CN115546273A - Scene structure depth estimation method for indoor fisheye image - Google Patents

Scene structure depth estimation method for indoor fisheye image Download PDF

Info

Publication number
CN115546273A
CN115546273A CN202211397138.0A CN202211397138A CN115546273A CN 115546273 A CN115546273 A CN 115546273A CN 202211397138 A CN202211397138 A CN 202211397138A CN 115546273 A CN115546273 A CN 115546273A
Authority
CN
China
Prior art keywords
depth
image
fisheye image
scene structure
fisheye
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211397138.0A
Other languages
Chinese (zh)
Inventor
孟明
肖立凯
周忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202211397138.0A priority Critical patent/CN115546273A/en
Publication of CN115546273A publication Critical patent/CN115546273A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a scene structure depth estimation method facing an indoor fisheye image, which comprises the following steps: (1) designing a feature-based target loss function; (2) designing a distortion perception module based on fisheye projection; (3) Constructing a scene structure depth estimation network model based on a coding-decoding strategy, and setting corresponding model training parameters; (4) Training and optimizing the model through a training data set of the fisheye image depth; (5) And inputting the test data set of the fisheye image depth into a training model, and predicting a corresponding scene structure depth map. The method inputs the given RGB fisheye image and the corresponding mask image, realizes the end-to-end estimation of the scene structure depth from a single fisheye image for the first time, is widely applied to virtual/augmented reality and robot indoor navigation, and improves the speed and the precision of three-dimensional reconstruction and three-dimensional scene understanding.

Description

Scene structure depth estimation method for indoor fisheye image
Technical Field
The invention relates to the technical field of indoor scene understanding, in particular to a scene structure depth estimation method facing an indoor fisheye image.
Background
The depth estimation of the single RGB image is used as a basic task of visual research and widely applied to visual tasks such as robot indoor navigation, three-dimensional map reconstruction and three-dimensional scene understanding. The depth estimation of the indoor scene is to estimate the distance between an indoor object and a shooting camera from a depth map and determine the three-dimensional space position of the object, wherein the three-dimensional space position comprises movable objects such as a bed, a table and a stool and fixed and immovable structures such as the ground, the wall surface and the ceiling. The indoor scene depth estimation based on the single RGB perspective image has made remarkable progress, and the adopted methods are roughly divided into two types, one type is that the traditional machine learning algorithm comprises Markov random fields MRFs, conditional random fields CRFs and the like. The other type is a mainstream deep learning algorithm which learns the mapping relation between each pixel of the image and the corresponding depth value through a convolutional neural network, and the deep learning algorithm comprises network models such as FCRN, omniDepth, LDPN and BiFuse. To capture more information of a scene, more and more portable and consumer fisheye cameras are emerging, and at the same time, research on fisheye image depth estimation is receiving much attention.
Extracting depth information from an omnidirectional image is similar to depth estimation of perspective views, but distortion problems in an omnidirectional image make it difficult to directly use a depth estimation method of perspective views. The initial solution is to slice the omnidirectional image, migrate the perspective depth estimation method to the slice small image to estimate the depth value, and synthesize the perspective depth map to obtain the corresponding omnidirectional depth map. However, this type of method still does not fully utilize the global context information and is time-consuming and inefficient. In order to balance distortion and a view angle in an omnidirectional mode, researchers consider a projection fusion mode, and jointly train an omnidirectional projection image and a stereoscopic projection image of an omnidirectional image, so that the distortion problem in depth estimation is solved.
Disclosure of Invention
The technical problem of the invention is solved: the method for estimating the depth of the scene structure facing the indoor fisheye image is provided, and aims to solve the problems of geometric distortion, low accuracy of a depth estimation model and fuzzy depth estimation of the edge of an object of the existing fisheye image facing the indoor fisheye image.
The technical solution of the invention is as follows: a scene structure depth estimation method facing an indoor fisheye image comprises the following steps:
(1) Constructing a scene structure depth estimation network model based on a coding-decoding strategy, and setting training parameters of the network model; a fisheye distortion convolution module is adopted in an encoder, the geometric distortion information in the fisheye image is learned by utilizing deformable convolution, and the fisheye image is subjected to local convolution operation, so that the extraction accuracy of the characteristic information in the fisheye image is improved; deepening the network structure depth by adopting an upward mapping layer module in a decoder; meanwhile, jump connection is added between an encoder and a decoder, so that the scene structure depth estimation accuracy of the network model is improved; adopting a target loss function based on image characteristics in the network model training process;
(2) Training and optimizing the scene structure depth estimation network model through a training data set of fisheye image depth;
(3) And inputting the test data set of the fisheye image depth into a training scene structure depth estimation network model, and predicting the scene structure depth of the input fisheye image.
Further, in the step (1), the target loss function L based on the image features is as follows:
L=ω 1 L depth2 L grad3 L normal
L depth the depth loss term designed according to the distance characteristics of objects in the image is calculated according to the following formula:
Figure BDA0003933959350000021
Wherein N is the number of samples, d i Representing predicted depth values of structures, g i The real structural depth value is alpha, which is an adjustable parameter, and the empirical value is set to 0.5; i represents the number of pixels;
L grad the gradient loss term designed according to the edge gradient structure characteristics in the image has the following calculation formula:
Figure BDA0003933959350000022
wherein,
Figure BDA0003933959350000023
and
Figure BDA0003933959350000024
the edge gradient size represented by the vector respectively represents the partial derivative of the depth error in the (x, y) direction; x is the horizontal gradient and y is the vertical depth;
L normal normal vector loss term L designed according to shape characteristics of objects in image normal The method utilizes the constraint of a surface normal vector to improve the estimation precision of the scene structure depth map on global and local details, and the calculation formula is as follows:
Figure BDA0003933959350000025
wherein,
Figure BDA0003933959350000031
respectively representing normal vectors calculated in the predicted and real structural depth maps,
Figure BDA0003933959350000032
representation predictionPerforming inner product operation on the normal vector and the real normal vector;
ω 1 、ω 2 and ω 3 The weight coefficients corresponding to the three loss terms are respectively. The invention takes the value that three weight coefficients are all 1, which indicates that each item has the same weight.
Further, in the step (1), in constructing a scene structure depth estimation network model based on an encoding-decoding strategy, an encoder extracts semantic features of input fisheye images by using ResNet-50 as a backbone network, learns the dependency relationship between image pixel points, and outputs a feature map containing low-dimensional semantic information and high-dimensional semantic information; a fish eye distortion convolution module is adopted in the third to fifth bottleneck blocks in the ResNet-50, so that the learning capacity of a scene structure depth estimation model on the fish eye image distortion is enhanced; the fisheye distortion convolution module is designed by adopting a fisheye image projection model.
Further, in the step (1), in constructing a scene structure depth estimation network model based on an encoding-decoding strategy, the decoder is implemented as:
taking a feature graph obtained by an encoder as input, and constructing feature decoding based on an upward mapping layer module; the decoder comprises four upward mapping layer modules which are used for increasing the resolution of the feature map and realizing the decoding of semantic features, mapping the learned distributed feature representation to a sample mark space through a supervised end-to-end learning mode and outputting a predicted structure depth map; and each upward mapping layer module adopts a residual error structural design.
Compared with the prior art, the invention has the advantages that:
(1) In order to solve the problems of the existing fish-eye image ultra-large geometric distortion, the fuzzy object edge depth and the poor depth estimation effect of a scene structure depth estimation model, the fish-eye depth estimation method is based on a fish-eye distortion convolution module to collect omnidirectional characteristic information of fish eyes and weaken distortion interference. In addition, an upward mapping layer module is designed based on the residual error idea, and a structural depth estimation network with deeper network level and higher precision is constructed by using the module; meanwhile, a target loss function based on gradient is provided, and a depth network is trained end to end by using the target loss function, so that a better depth estimation effect on the fisheye image is realized.
(2) The design of the loss function plays an important role in the training process of the neural network. The depth estimation belongs to a regression task in machine learning, and loss functions commonly used in the current regression task comprise a mean square error and an absolute value loss function. The absolute value loss function does not satisfy the continuously derivable condition, and the mean square loss function is sensitive to the abnormal point and may influence the training. Therefore, the method adopts the loss function based on the image characteristics to guide the training of the network, can better estimate the depth information at the edge of the object, and improves the precision of the depth estimation network through an end-to-end supervised learning mode.
(3) The distribution of the sampling points of the conventional convolution kernel is fixed and unchangeable, one convolution kernel can only sample at a fixed position of an input feature map, and the receptive field is always in a rectangular shape, so that the capability of the convolution neural network for modeling geometric transformation is poor. Objects with different shapes and sizes exist in the fisheye image processed by the method, and the fisheye camera nonlinear orthogonal projection model enables serious distortion to exist in the image.
(4) The fisheye image-oriented scene structure depth estimation method has less research, and due to the characteristics of large visual angle and distortion of the fisheye image, the accuracy of the conventional depth estimation network on a fisheye data set is poor, so that the fisheye image-oriented structure depth estimation network is designed. The whole network follows an encoder-decoder strategy, a fisheye image and a corresponding mask image are used as input of the network, image features are extracted through an encoder, the dependency relationship among image pixel points is learned, and a feature map containing low-dimensional semantic information and high-dimensional semantic information is output. And taking the feature map as an input, decoding the features based on a decoder constructed by an upward mapping layer module, mapping the learned distributed feature representation to a sample mark space in a supervised end-to-end learning mode, and outputting a predicted depth map. In order to fully utilize the feature maps with different scales, a jump connection structure is introduced between an encoder and a decoder, and the accuracy of depth estimation is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a general flow diagram illustrating the depth estimation of scene structure for indoor fisheye image according to the present invention;
FIG. 2 is a block diagram of an upward mapping layer module according to the present invention;
FIG. 3 is a diagram of the end-to-end neural network architecture of the present invention;
FIG. 4 is a schematic diagram of network input and prediction according to the present invention, (a) is a network input RGB fisheye image, (b) is a mask image corresponding to the network input, and (c) is a depth image of the result predicted by the network;
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and based on the embodiments of the present invention, all other embodiments obtained by a person skilled in the art without creative efforts belong to the protection scope of the present invention.
As shown in fig. 1, the method of the present invention is implemented as follows:
1. designing a feature-based objective loss function
a) And designing a target loss function according to loss terms corresponding to different features in the image, wherein the target loss function mainly comprises a depth loss term, a gradient loss term and a normal vector loss term.
b) The most common characteristic of the actual scene image is the distance difference of objects, and if the characteristic is ignored, the characteristic is directCalculating a depth loss term L using equally weighted depth error values depth The phenomenon of blurring in the depth estimation result can be caused, and the designed depth loss term is as follows:
Figure BDA0003933959350000051
c) The most features except the features of distance difference in the actual scene image are the structural features of multi-step edges. The depth loss term described above balances only the depth transformation direction, and it is difficult to deal with the different shifts occurring in the depth direction. In order to further optimize the depth information of different gradients at the edge, the designed gradient loss term is:
Figure BDA0003933959350000052
d) Depth maps, which are a form of three-dimensional models of scenes, can be represented as a finite number of smooth surfaces and step edges between the surfaces. Although the gradient loss mentioned above can optimize for different depths of the edge, it is difficult to effectively process features on the shape in the scene, such as the common steep edge, corner and planar structures. The normal vector can encode the angle information of the surface of the object, the uniform normal vector can be adopted for carrying out global constraint on the plane characteristics, meanwhile, the angle information in the plane characteristics can be effectively adopted for carrying out constraint on local structural characteristics, and the designed normal vector loss item is as follows:
Figure BDA0003933959350000053
e) In summary of the loss terms, the target loss function for performing the omnidirectional image structure depth estimation is specifically defined as:
L=ω 1 L depth2 L grad3 L normal
parameters of the depth estimation network are optimized through the constructed target loss function based on the image characteristics, and a more accurate depth prediction result is obtained.
2. Design fish eye distortion convolution module
a) First, a network R of rules is defined for inputting the characteristic diagram f l A regular sampling is performed where the grid area R is determined by the convolution kernel size and dilation. Setting R = { (-1, -1), (-1, 0),. -, (0, 1), (1, 1) } indicates that the size of the receptive field is 3 × 3 and the dilation is 1. Calculating corresponding output characteristic graph f at different positions in traversal process through weighting operation l+1 The output mapping relation is as follows:
Figure BDA0003933959350000054
wherein, f l+1 (p 0 ) Representing points p in the output characteristic diagram 0 By summing f l And calculating the product between the median sampling value and the weight omega. f. of l (p 0 +p n ) Representing p in the input feature map 0 +p n One pixel, p n Is at f on the grid R l (p 0 ) N sampling points obtained for center out-diffusion, where (-1, -1) of the grid R represents f l (p 0 ) In the upper left corner of (1, 1) denotes f l (p 0 ) The lower right corner of the panel. According to the calculation process, the standard convolution sampling in the CNN is always fixed and unchanged, the receptive field of each sampling point is also an inherent rectangular structure, high-level semantic information is difficult to flexibly process, and the convolutional neural network adopting the standard convolution has poor geometric variation modeling capability.
b) On the basis of standard convolution, the invention extracts the effective area in the fisheye image through pretreatment to keep the consistency of the context, samples the corresponding irregular grid from the fisheye image, and calculates the position of the distorted pixel of the original grid according to a fisheye distortion projection model.
c) The invention introduces the designed fisheye distortion convolution module into the encoder of the network structure, as shown in figure 3, learns the distortion information of different positions and different degrees, and extracts distinctive features.
The fish eye distortion convolution module is specifically realized as follows:
(1) When the fisheye distortion convolution is calculated, extracting an effective area in a fisheye image through pretreatment to keep the consistency of context, sampling a corresponding irregular grid from the fisheye image, and calculating the position of a distortion pixel according to the original grid as follows:
Figure BDA0003933959350000061
wherein,
Figure BDA0003933959350000062
for p calculated by fish eye distortion projection model n The corresponding offset. p is a radical of 0 =(u(p 0 ),v(p 0 ) Is shown at f) l+1 Pixel position in (b).
(2) To calculate the offset, p is calculated 0 The longitude and latitude coordinates in the spherical coordinate system are as follows:
Figure BDA0003933959350000063
Figure BDA0003933959350000064
wherein,
Figure BDA0003933959350000065
w is the image width and u and v represent the abscissa and ordinate values of the pixel point position, respectively.
(3) Calculating a rotation matrix T by using an Euler-Luode Reed-Solomon rotation equation as follows:
Figure BDA0003933959350000066
wherein,
Figure BDA0003933959350000067
Representing the first counterclockwise rotation of the convolution kernel about the x-axis
Figure BDA0003933959350000068
R y (θ(p 0 ) Rotated counterclockwise by an angle theta (p) around the y-axis 0 )。
(4) Any point p on the convolution kernel is rotated by the matrix T n The rotation is as follows:
Figure BDA0003933959350000069
wherein p is n =[i,j,d],i,j∈[-k w /2,-k h /2],k w And k h The resolution of the convolution kernel.
(5) d is the distance from R to the center of the unit sphere, and is calculated according to the field of view and the size of the convolution kernel as follows:
Figure BDA0003933959350000071
(6) Then mapping the three-dimensional space of the rotated convolution kernel to corresponding longitude and latitude coordinates as follows:
Figure BDA0003933959350000072
Figure BDA0003933959350000073
(7) The transformed longitude and latitude coordinates are projected to corresponding pixel coordinates in the fisheye image as follows:
Figure BDA0003933959350000074
Figure BDA0003933959350000075
(8) Calculating the offset
Figure BDA0003933959350000076
Wherein u (Δ p' n ),v(Δp’ n ) Respectively as follows:
u(Δp’ n )=u(p’ n )-u(p n )
v(Δp’ n )=v(p’ n )-v(p n )
3. scene structure depth estimation network model based on coding-decoding strategy
(1) An effective fisheye image structure depth estimation network is designed by adopting an encoding-decoding strategy, and the whole network structure is shown in fig. 3.
(2) The input of the whole network structure comprises two parts, and the fisheye image and the corresponding mask image are used as the input of the network, which are respectively shown as (a) and (b) in fig. 4. All pixel values corresponding to the movable object are set to be 0 and presented in black, and pixel values of other structural regions are set to be 255 and presented in white as bitmaps in the mask map. When the mask map is added to guide the depth estimation of the structure, two different methods are adopted, and the mask map and the RGB image are cross-multiplied and then input into an encoding structure or are directly connected to a decoding structure respectively. And estimating a scene structure depth map with the variable animal bodies removed through the designed coding and decoding structure.
(3) The ResNet50 of a full connection layer is still selected to be removed from a main network in the encoder, a fisheye distortion convolution module is introduced into the final convolution layer of the ResNet50, the problem of low-efficiency feature learning caused by geometric distortion in all directions is solved, the dependency relationship among image pixel points is learned, a feature map containing low-dimensional semantic information and high-dimensional semantic information is output, and the modeling capability of geometric transformation in structural depth estimation is improved.
(4) The decoder consists of four upward mapping modules and a 3 x 3 convolutional layer, and mainly aims to restore the resolution of the feature map to the original image size and decode the semantic features obtained in the encoding, wherein a bilinear interpolation method is adopted for up-sampling to increase the resolution of the feature map. On the basis, an upward mapping layer module based on a residual error structure is designed, as shown in fig. 2, the depth of a network structure is further increased, the problems of gradient loss and overfitting are avoided, and the learning capability of the model is improved.
As shown in fig. 2, the upward mapping layer module designed based on the residual structure includes:
(a) The resolution of the input feature map of the module is H multiplied by W, wherein H is the height of the feature map, and W is the width of the feature map;
(b) The up-sampling operation is carried out by using a bilinear interpolation mode, the resolution of the feature map is expanded to be twice of the original resolution, and the sawtooth phenomenon caused by the discontinuous gray value of the expanded feature map is avoided;
(c) The upper and lower branches adopt 5 multiplied by 5 convolution layers respectively for smoothing treatment;
(d) The upper layer branch then uses a 3 × 3 convolutional layer processing characteristic information to add the output characteristic maps of the two branches, and the result is activated by the ReLU to be used as the output of the upward mapping layer module.
(5) In order to fully utilize the omnidirectional semantic information in different scale feature maps, the multi-scale features in the encoder and the decoder are subjected to jump connection fusion, the accuracy of network estimation structure depth is further improved, the learned distributed feature representation is mapped to a sample mark space in a supervised end-to-end learning mode, and a predicted depth map is output, as shown in (c) of fig. 4.

Claims (4)

1. The scene structure depth estimation method for the indoor fisheye image is characterized by comprising the following steps of:
(1) Constructing a scene structure depth estimation network model based on an encoding-decoding strategy, and setting training parameters of the network model; a fisheye distortion convolution module is adopted in an encoder, the geometric distortion information in the fisheye image is learned by utilizing deformable convolution, local geometric feature convolution operation is carried out on the fisheye image, and the extraction accuracy of the feature information in the fisheye image is improved; deepening the network structure depth by adopting an upward mapping layer module in a decoder; meanwhile, jump connection is added between an encoder and a decoder, and the scene structure depth estimation accuracy of the network model is improved; adopting a target loss function based on image characteristics in the network model training process;
(2) Training and optimizing a scene structure depth estimation network model through a training data set of fisheye image depth;
(3) And inputting the test data set of the fisheye image depth into a training scene structure depth estimation network model, and predicting the scene structure depth of the input fisheye image.
2. The method for estimating the scene structure depth of the fisheye image facing the indoor according to claim 1, characterized in that: in the step (1), the target loss function L based on the image features is as follows:
L=ω 1 L depth2 L grad3 L normal
L depth for the depth loss term, the calculation formula is as follows:
Figure FDA0003933959340000011
wherein N is the number of samples, d i Representing predicted depth values of structures, g i Alpha is an adjustable parameter for a real structure depth value, and i represents the number of pixels;
L grad for the gradient loss term, the calculation formula is:
Figure FDA0003933959340000012
wherein,
Figure FDA0003933959340000013
and
Figure FDA0003933959340000014
the edge gradient size represented by the vector respectively represents the partial derivatives of the depth error in the x direction and the y direction; x is the horizontal gradient and y is the vertical depth;
L normal is a normal vector loss term L normal The calculation formula is as follows:
Figure FDA0003933959340000015
wherein,
Figure FDA0003933959340000016
respectively representing normal vectors calculated in the predicted and real structural depth maps,
Figure FDA0003933959340000017
representing an inner product operation of a prediction normal vector and a real normal vector; omega 1 、ω 2 And ω 3 The weight coefficients corresponding to the three loss terms are respectively.
3. The method for estimating the scene structure depth of the fisheye image facing the indoor according to claim 1, characterized in that: in the step (1), in a scene structure depth estimation network model based on an encoding-decoding strategy is constructed, an encoder takes ResNet-50 as a main network to extract semantic features of input fisheye images, learns the dependency relationship among image pixels and outputs a feature map containing low-dimensional semantic information and high-dimensional semantic information; a fish eye distortion convolution module is adopted in the third to fifth bottleneck blocks in the ResNet-50, so that the learning capacity of a scene structure depth estimation model on the fish eye image distortion is enhanced; the fisheye distortion convolution module is designed by adopting a fisheye image projection model.
4. The method for estimating the scene structure depth of the fisheye image facing the indoor according to claim 1, characterized in that: in the step (1), in constructing a scene structure depth estimation network model based on an encoding-decoding strategy, a decoder is implemented as follows:
taking a feature graph obtained by an encoder as input, and constructing feature decoding based on an upward mapping layer module; the decoder comprises four upward mapping layer modules which are responsible for increasing the resolution of the feature map and realizing the decoding of semantic features, maps the learned distributed feature representation to a sample mark space through a supervised end-to-end learning mode and outputs a predicted structure depth map; and each upward mapping layer module adopts a residual error structure design.
CN202211397138.0A 2022-11-09 2022-11-09 Scene structure depth estimation method for indoor fisheye image Pending CN115546273A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211397138.0A CN115546273A (en) 2022-11-09 2022-11-09 Scene structure depth estimation method for indoor fisheye image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211397138.0A CN115546273A (en) 2022-11-09 2022-11-09 Scene structure depth estimation method for indoor fisheye image

Publications (1)

Publication Number Publication Date
CN115546273A true CN115546273A (en) 2022-12-30

Family

ID=84719786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211397138.0A Pending CN115546273A (en) 2022-11-09 2022-11-09 Scene structure depth estimation method for indoor fisheye image

Country Status (1)

Country Link
CN (1) CN115546273A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152491A (en) * 2023-01-03 2023-05-23 北京海天瑞声科技股份有限公司 Semantic segmentation method, semantic segmentation device and storage medium
CN117036613A (en) * 2023-08-18 2023-11-10 武汉大学 Polarization three-dimensional reconstruction method and system based on multiple receptive field blending network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053441A (en) * 2020-10-14 2020-12-08 北京大视景科技有限公司 Full-automatic layout recovery method for indoor fisheye image
CN115063463A (en) * 2022-06-20 2022-09-16 东南大学 Fish-eye camera scene depth estimation method based on unsupervised learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053441A (en) * 2020-10-14 2020-12-08 北京大视景科技有限公司 Full-automatic layout recovery method for indoor fisheye image
CN115063463A (en) * 2022-06-20 2022-09-16 东南大学 Fish-eye camera scene depth estimation method based on unsupervised learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
IRO LAINA ET AL.: ""Deeper Depth Prediction with Fully Convolutional Residual Networks"", ARXIV, 19 September 2016 (2016-09-19), pages 1 - 12, XP055759673, DOI: 10.1109/3DV.2016.32 *
JUNJIE HU ET AL.: ""Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps with Accurate Object Boundaries"", ARXIV, 22 September 2018 (2018-09-22), pages 1 - 9 *
MING MENG ET AL.: ""Distortion-Aware Room Layout Estimation from A Single Fisheye Image"", 2021 IEEE INTERNATIONAL SYMPOSIUM ON MIXED AND AUGMENTED REALITY, pages 441 - 449 *
江忠泽: ""基于全卷积网络的像素级场景理解技术研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 05, pages 22 - 26 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152491A (en) * 2023-01-03 2023-05-23 北京海天瑞声科技股份有限公司 Semantic segmentation method, semantic segmentation device and storage medium
CN116152491B (en) * 2023-01-03 2023-12-26 北京海天瑞声科技股份有限公司 Semantic segmentation method, semantic segmentation device and storage medium
CN117036613A (en) * 2023-08-18 2023-11-10 武汉大学 Polarization three-dimensional reconstruction method and system based on multiple receptive field blending network
CN117036613B (en) * 2023-08-18 2024-04-02 武汉大学 Polarization three-dimensional reconstruction method and system based on multiple receptive field blending network

Similar Documents

Publication Publication Date Title
CN110782490B (en) Video depth map estimation method and device with space-time consistency
CN110503680B (en) Unsupervised convolutional neural network-based monocular scene depth estimation method
CN115546273A (en) Scene structure depth estimation method for indoor fisheye image
Tian et al. Depth estimation using a self-supervised network based on cross-layer feature fusion and the quadtree constraint
CN113313732A (en) Forward-looking scene depth estimation method based on self-supervision learning
CN111292408B (en) Shadow generation method based on attention mechanism
CN110910437B (en) Depth prediction method for complex indoor scene
CN112801047B (en) Defect detection method and device, electronic equipment and readable storage medium
CN115512073A (en) Three-dimensional texture grid reconstruction method based on multi-stage training under differentiable rendering
CN113052976A (en) Single-image large-pose three-dimensional color face reconstruction method based on UV position map and CGAN
CN117218343A (en) Semantic component attitude estimation method based on deep learning
CN115115805A (en) Training method, device and equipment for three-dimensional reconstruction model and storage medium
CN116310219A (en) Three-dimensional foot shape generation method based on conditional diffusion model
CN115035171A (en) Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion
CN116563682A (en) Attention scheme and strip convolution semantic line detection method based on depth Hough network
CN117315169A (en) Live-action three-dimensional model reconstruction method and system based on deep learning multi-view dense matching
CN117809016A (en) Cloud layer polarization removal orientation method based on deep learning
CN116485892A (en) Six-degree-of-freedom pose estimation method for weak texture object
CN114821192A (en) Remote sensing image elevation prediction method combining semantic information
CN113673567B (en) Panorama emotion recognition method and system based on multi-angle sub-region self-adaption
CN114820899A (en) Attitude estimation method and device based on multi-view rendering
CN115272450A (en) Target positioning method based on panoramic segmentation
CN113239771A (en) Attitude estimation method, system and application thereof
CN111627098A (en) Method and device for identifying water flow area in image and generating dynamic water flow video
CN114219900B (en) Three-dimensional scene reconstruction method, reconstruction system and application based on mixed reality glasses

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination