CN115546273A

CN115546273A - Scene structure depth estimation method for indoor fisheye image

Info

Publication number: CN115546273A
Application number: CN202211397138.0A
Authority: CN
Inventors: 孟明; 肖立凯; 周忠
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2022-12-30

Abstract

The invention relates to a scene structure depth estimation method facing an indoor fisheye image, which comprises the following steps: (1) designing a feature-based target loss function; (2) designing a distortion perception module based on fisheye projection; (3) Constructing a scene structure depth estimation network model based on a coding-decoding strategy, and setting corresponding model training parameters; (4) Training and optimizing the model through a training data set of the fisheye image depth; (5) And inputting the test data set of the fisheye image depth into a training model, and predicting a corresponding scene structure depth map. The method inputs the given RGB fisheye image and the corresponding mask image, realizes the end-to-end estimation of the scene structure depth from a single fisheye image for the first time, is widely applied to virtual/augmented reality and robot indoor navigation, and improves the speed and the precision of three-dimensional reconstruction and three-dimensional scene understanding.

Description

Scene structure depth estimation method for indoor fisheye image

Technical Field

The invention relates to the technical field of indoor scene understanding, in particular to a scene structure depth estimation method facing an indoor fisheye image.

Background

The depth estimation of the single RGB image is used as a basic task of visual research and widely applied to visual tasks such as robot indoor navigation, three-dimensional map reconstruction and three-dimensional scene understanding. The depth estimation of the indoor scene is to estimate the distance between an indoor object and a shooting camera from a depth map and determine the three-dimensional space position of the object, wherein the three-dimensional space position comprises movable objects such as a bed, a table and a stool and fixed and immovable structures such as the ground, the wall surface and the ceiling. The indoor scene depth estimation based on the single RGB perspective image has made remarkable progress, and the adopted methods are roughly divided into two types, one type is that the traditional machine learning algorithm comprises Markov random fields MRFs, conditional random fields CRFs and the like. The other type is a mainstream deep learning algorithm which learns the mapping relation between each pixel of the image and the corresponding depth value through a convolutional neural network, and the deep learning algorithm comprises network models such as FCRN, omniDepth, LDPN and BiFuse. To capture more information of a scene, more and more portable and consumer fisheye cameras are emerging, and at the same time, research on fisheye image depth estimation is receiving much attention.

Extracting depth information from an omnidirectional image is similar to depth estimation of perspective views, but distortion problems in an omnidirectional image make it difficult to directly use a depth estimation method of perspective views. The initial solution is to slice the omnidirectional image, migrate the perspective depth estimation method to the slice small image to estimate the depth value, and synthesize the perspective depth map to obtain the corresponding omnidirectional depth map. However, this type of method still does not fully utilize the global context information and is time-consuming and inefficient. In order to balance distortion and a view angle in an omnidirectional mode, researchers consider a projection fusion mode, and jointly train an omnidirectional projection image and a stereoscopic projection image of an omnidirectional image, so that the distortion problem in depth estimation is solved.

Disclosure of Invention

The technical problem of the invention is solved: the method for estimating the depth of the scene structure facing the indoor fisheye image is provided, and aims to solve the problems of geometric distortion, low accuracy of a depth estimation model and fuzzy depth estimation of the edge of an object of the existing fisheye image facing the indoor fisheye image.

The technical solution of the invention is as follows: a scene structure depth estimation method facing an indoor fisheye image comprises the following steps:

(1) Constructing a scene structure depth estimation network model based on a coding-decoding strategy, and setting training parameters of the network model; a fisheye distortion convolution module is adopted in an encoder, the geometric distortion information in the fisheye image is learned by utilizing deformable convolution, and the fisheye image is subjected to local convolution operation, so that the extraction accuracy of the characteristic information in the fisheye image is improved; deepening the network structure depth by adopting an upward mapping layer module in a decoder; meanwhile, jump connection is added between an encoder and a decoder, so that the scene structure depth estimation accuracy of the network model is improved; adopting a target loss function based on image characteristics in the network model training process;

(2) Training and optimizing the scene structure depth estimation network model through a training data set of fisheye image depth;

(3) And inputting the test data set of the fisheye image depth into a training scene structure depth estimation network model, and predicting the scene structure depth of the input fisheye image.

Further, in the step (1), the target loss function L based on the image features is as follows:

L＝ω ₁ L _depth +ω ₂ L _grad +ω ₃ L _normal

L _depth the depth loss term designed according to the distance characteristics of objects in the image is calculated according to the following formula：

Wherein N is the number of samples, d _i Representing predicted depth values of structures, g _i The real structural depth value is alpha, which is an adjustable parameter, and the empirical value is set to 0.5; i represents the number of pixels;

L _grad the gradient loss term designed according to the edge gradient structure characteristics in the image has the following calculation formula:

wherein,

and

the edge gradient size represented by the vector respectively represents the partial derivative of the depth error in the (x, y) direction; x is the horizontal gradient and y is the vertical depth;

L _normal normal vector loss term L designed according to shape characteristics of objects in image _normal The method utilizes the constraint of a surface normal vector to improve the estimation precision of the scene structure depth map on global and local details, and the calculation formula is as follows:

wherein,

respectively representing normal vectors calculated in the predicted and real structural depth maps,

representation predictionPerforming inner product operation on the normal vector and the real normal vector;

ω ₁ 、ω ₂ and ω ₃ The weight coefficients corresponding to the three loss terms are respectively. The invention takes the value that three weight coefficients are all 1, which indicates that each item has the same weight.

Further, in the step (1), in constructing a scene structure depth estimation network model based on an encoding-decoding strategy, an encoder extracts semantic features of input fisheye images by using ResNet-50 as a backbone network, learns the dependency relationship between image pixel points, and outputs a feature map containing low-dimensional semantic information and high-dimensional semantic information; a fish eye distortion convolution module is adopted in the third to fifth bottleneck blocks in the ResNet-50, so that the learning capacity of a scene structure depth estimation model on the fish eye image distortion is enhanced; the fisheye distortion convolution module is designed by adopting a fisheye image projection model.

Further, in the step (1), in constructing a scene structure depth estimation network model based on an encoding-decoding strategy, the decoder is implemented as:

taking a feature graph obtained by an encoder as input, and constructing feature decoding based on an upward mapping layer module; the decoder comprises four upward mapping layer modules which are used for increasing the resolution of the feature map and realizing the decoding of semantic features, mapping the learned distributed feature representation to a sample mark space through a supervised end-to-end learning mode and outputting a predicted structure depth map; and each upward mapping layer module adopts a residual error structural design.

Compared with the prior art, the invention has the advantages that:

(1) In order to solve the problems of the existing fish-eye image ultra-large geometric distortion, the fuzzy object edge depth and the poor depth estimation effect of a scene structure depth estimation model, the fish-eye depth estimation method is based on a fish-eye distortion convolution module to collect omnidirectional characteristic information of fish eyes and weaken distortion interference. In addition, an upward mapping layer module is designed based on the residual error idea, and a structural depth estimation network with deeper network level and higher precision is constructed by using the module; meanwhile, a target loss function based on gradient is provided, and a depth network is trained end to end by using the target loss function, so that a better depth estimation effect on the fisheye image is realized.

(2) The design of the loss function plays an important role in the training process of the neural network. The depth estimation belongs to a regression task in machine learning, and loss functions commonly used in the current regression task comprise a mean square error and an absolute value loss function. The absolute value loss function does not satisfy the continuously derivable condition, and the mean square loss function is sensitive to the abnormal point and may influence the training. Therefore, the method adopts the loss function based on the image characteristics to guide the training of the network, can better estimate the depth information at the edge of the object, and improves the precision of the depth estimation network through an end-to-end supervised learning mode.

(3) The distribution of the sampling points of the conventional convolution kernel is fixed and unchangeable, one convolution kernel can only sample at a fixed position of an input feature map, and the receptive field is always in a rectangular shape, so that the capability of the convolution neural network for modeling geometric transformation is poor. Objects with different shapes and sizes exist in the fisheye image processed by the method, and the fisheye camera nonlinear orthogonal projection model enables serious distortion to exist in the image.

(4) The fisheye image-oriented scene structure depth estimation method has less research, and due to the characteristics of large visual angle and distortion of the fisheye image, the accuracy of the conventional depth estimation network on a fisheye data set is poor, so that the fisheye image-oriented structure depth estimation network is designed. The whole network follows an encoder-decoder strategy, a fisheye image and a corresponding mask image are used as input of the network, image features are extracted through an encoder, the dependency relationship among image pixel points is learned, and a feature map containing low-dimensional semantic information and high-dimensional semantic information is output. And taking the feature map as an input, decoding the features based on a decoder constructed by an upward mapping layer module, mapping the learned distributed feature representation to a sample mark space in a supervised end-to-end learning mode, and outputting a predicted depth map. In order to fully utilize the feature maps with different scales, a jump connection structure is introduced between an encoder and a decoder, and the accuracy of depth estimation is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a general flow diagram illustrating the depth estimation of scene structure for indoor fisheye image according to the present invention;

FIG. 2 is a block diagram of an upward mapping layer module according to the present invention;

FIG. 3 is a diagram of the end-to-end neural network architecture of the present invention;

FIG. 4 is a schematic diagram of network input and prediction according to the present invention, (a) is a network input RGB fisheye image, (b) is a mask image corresponding to the network input, and (c) is a depth image of the result predicted by the network;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and based on the embodiments of the present invention, all other embodiments obtained by a person skilled in the art without creative efforts belong to the protection scope of the present invention.

As shown in fig. 1, the method of the present invention is implemented as follows:

1. designing a feature-based objective loss function

a) And designing a target loss function according to loss terms corresponding to different features in the image, wherein the target loss function mainly comprises a depth loss term, a gradient loss term and a normal vector loss term.

b) The most common characteristic of the actual scene image is the distance difference of objects, and if the characteristic is ignored, the characteristic is directCalculating a depth loss term L using equally weighted depth error values _depth The phenomenon of blurring in the depth estimation result can be caused, and the designed depth loss term is as follows:

c) The most features except the features of distance difference in the actual scene image are the structural features of multi-step edges. The depth loss term described above balances only the depth transformation direction, and it is difficult to deal with the different shifts occurring in the depth direction. In order to further optimize the depth information of different gradients at the edge, the designed gradient loss term is:

d) Depth maps, which are a form of three-dimensional models of scenes, can be represented as a finite number of smooth surfaces and step edges between the surfaces. Although the gradient loss mentioned above can optimize for different depths of the edge, it is difficult to effectively process features on the shape in the scene, such as the common steep edge, corner and planar structures. The normal vector can encode the angle information of the surface of the object, the uniform normal vector can be adopted for carrying out global constraint on the plane characteristics, meanwhile, the angle information in the plane characteristics can be effectively adopted for carrying out constraint on local structural characteristics, and the designed normal vector loss item is as follows:

e) In summary of the loss terms, the target loss function for performing the omnidirectional image structure depth estimation is specifically defined as:

L＝ω ₁ L _depth +ω ₂ L _grad +ω ₃ L _normal

parameters of the depth estimation network are optimized through the constructed target loss function based on the image characteristics, and a more accurate depth prediction result is obtained.

2. Design fish eye distortion convolution module

a) First, a network R of rules is defined for inputting the characteristic diagram f _l A regular sampling is performed where the grid area R is determined by the convolution kernel size and dilation. Setting R = { (-1, -1), (-1, 0),. -, (0, 1), (1, 1) } indicates that the size of the receptive field is 3 × 3 and the dilation is 1. Calculating corresponding output characteristic graph f at different positions in traversal process through weighting operation _l+1 The output mapping relation is as follows:

wherein, f _l+1 (p ₀ ) Representing points p in the output characteristic diagram ₀ By summing f _l And calculating the product between the median sampling value and the weight omega. f. of _l (p ₀ +p _n ) Representing p in the input feature map ₀ +p _n One pixel, p _n Is at f on the grid R _l (p ₀ ) N sampling points obtained for center out-diffusion, where (-1, -1) of the grid R represents f _l (p ₀ ) In the upper left corner of (1, 1) denotes f _l (p ₀ ) The lower right corner of the panel. According to the calculation process, the standard convolution sampling in the CNN is always fixed and unchanged, the receptive field of each sampling point is also an inherent rectangular structure, high-level semantic information is difficult to flexibly process, and the convolutional neural network adopting the standard convolution has poor geometric variation modeling capability.

b) On the basis of standard convolution, the invention extracts the effective area in the fisheye image through pretreatment to keep the consistency of the context, samples the corresponding irregular grid from the fisheye image, and calculates the position of the distorted pixel of the original grid according to a fisheye distortion projection model.

c) The invention introduces the designed fisheye distortion convolution module into the encoder of the network structure, as shown in figure 3, learns the distortion information of different positions and different degrees, and extracts distinctive features.

The fish eye distortion convolution module is specifically realized as follows:

(1) When the fisheye distortion convolution is calculated, extracting an effective area in a fisheye image through pretreatment to keep the consistency of context, sampling a corresponding irregular grid from the fisheye image, and calculating the position of a distortion pixel according to the original grid as follows:

wherein,

for p calculated by fish eye distortion projection model _n The corresponding offset. p is a radical of ₀ ＝(u(p ₀ )，v(p ₀ ) Is shown at f) _l+1 Pixel position in (b).

(2) To calculate the offset, p is calculated ₀ The longitude and latitude coordinates in the spherical coordinate system are as follows:

wherein,

w is the image width and u and v represent the abscissa and ordinate values of the pixel point position, respectively.

(3) Calculating a rotation matrix T by using an Euler-Luode Reed-Solomon rotation equation as follows:

wherein，

Representing the first counterclockwise rotation of the convolution kernel about the x-axis

R _y (θ(p ₀ ) Rotated counterclockwise by an angle theta (p) around the y-axis ₀ )。

(4) Any point p on the convolution kernel is rotated by the matrix T _n The rotation is as follows:

wherein p is _n ＝[i，j，d]，i，j∈[-k _w /2，-k _h /2]，k _w And k _h The resolution of the convolution kernel.

(5) d is the distance from R to the center of the unit sphere, and is calculated according to the field of view and the size of the convolution kernel as follows:

(6) Then mapping the three-dimensional space of the rotated convolution kernel to corresponding longitude and latitude coordinates as follows:

(7) The transformed longitude and latitude coordinates are projected to corresponding pixel coordinates in the fisheye image as follows:

(8) Calculating the offset

Wherein u (Δ p' _n )，v(Δp’ _n ) Respectively as follows:

u(Δp’ _n )＝u(p’ _n )-u(p _n )

v(Δp’ _n )＝v(p’ _n )-v(p _n )

3. scene structure depth estimation network model based on coding-decoding strategy

(1) An effective fisheye image structure depth estimation network is designed by adopting an encoding-decoding strategy, and the whole network structure is shown in fig. 3.

(2) The input of the whole network structure comprises two parts, and the fisheye image and the corresponding mask image are used as the input of the network, which are respectively shown as (a) and (b) in fig. 4. All pixel values corresponding to the movable object are set to be 0 and presented in black, and pixel values of other structural regions are set to be 255 and presented in white as bitmaps in the mask map. When the mask map is added to guide the depth estimation of the structure, two different methods are adopted, and the mask map and the RGB image are cross-multiplied and then input into an encoding structure or are directly connected to a decoding structure respectively. And estimating a scene structure depth map with the variable animal bodies removed through the designed coding and decoding structure.

(3) The ResNet50 of a full connection layer is still selected to be removed from a main network in the encoder, a fisheye distortion convolution module is introduced into the final convolution layer of the ResNet50, the problem of low-efficiency feature learning caused by geometric distortion in all directions is solved, the dependency relationship among image pixel points is learned, a feature map containing low-dimensional semantic information and high-dimensional semantic information is output, and the modeling capability of geometric transformation in structural depth estimation is improved.

(4) The decoder consists of four upward mapping modules and a 3 x 3 convolutional layer, and mainly aims to restore the resolution of the feature map to the original image size and decode the semantic features obtained in the encoding, wherein a bilinear interpolation method is adopted for up-sampling to increase the resolution of the feature map. On the basis, an upward mapping layer module based on a residual error structure is designed, as shown in fig. 2, the depth of a network structure is further increased, the problems of gradient loss and overfitting are avoided, and the learning capability of the model is improved.

As shown in fig. 2, the upward mapping layer module designed based on the residual structure includes:

(a) The resolution of the input feature map of the module is H multiplied by W, wherein H is the height of the feature map, and W is the width of the feature map;

(b) The up-sampling operation is carried out by using a bilinear interpolation mode, the resolution of the feature map is expanded to be twice of the original resolution, and the sawtooth phenomenon caused by the discontinuous gray value of the expanded feature map is avoided;

(c) The upper and lower branches adopt 5 multiplied by 5 convolution layers respectively for smoothing treatment;

(d) The upper layer branch then uses a 3 × 3 convolutional layer processing characteristic information to add the output characteristic maps of the two branches, and the result is activated by the ReLU to be used as the output of the upward mapping layer module.

(5) In order to fully utilize the omnidirectional semantic information in different scale feature maps, the multi-scale features in the encoder and the decoder are subjected to jump connection fusion, the accuracy of network estimation structure depth is further improved, the learned distributed feature representation is mapped to a sample mark space in a supervised end-to-end learning mode, and a predicted depth map is output, as shown in (c) of fig. 4.

Claims

1. The scene structure depth estimation method for the indoor fisheye image is characterized by comprising the following steps of:

(1) Constructing a scene structure depth estimation network model based on an encoding-decoding strategy, and setting training parameters of the network model; a fisheye distortion convolution module is adopted in an encoder, the geometric distortion information in the fisheye image is learned by utilizing deformable convolution, local geometric feature convolution operation is carried out on the fisheye image, and the extraction accuracy of the feature information in the fisheye image is improved; deepening the network structure depth by adopting an upward mapping layer module in a decoder; meanwhile, jump connection is added between an encoder and a decoder, and the scene structure depth estimation accuracy of the network model is improved; adopting a target loss function based on image characteristics in the network model training process;

(2) Training and optimizing a scene structure depth estimation network model through a training data set of fisheye image depth;

2. The method for estimating the scene structure depth of the fisheye image facing the indoor according to claim 1, characterized in that: in the step (1), the target loss function L based on the image features is as follows:

L＝ω ₁ L _depth +ω ₂ L _grad +ω ₃ L _normal

L _depth for the depth loss term, the calculation formula is as follows:

wherein N is the number of samples, d _i Representing predicted depth values of structures, g _i Alpha is an adjustable parameter for a real structure depth value, and i represents the number of pixels;

L _grad for the gradient loss term, the calculation formula is:

wherein,

and

the edge gradient size represented by the vector respectively represents the partial derivatives of the depth error in the x direction and the y direction; x is the horizontal gradient and y is the vertical depth;

L _normal is a normal vector loss term L _normal The calculation formula is as follows:

wherein,

representing an inner product operation of a prediction normal vector and a real normal vector; omega ₁ 、ω ₂ And ω ₃ The weight coefficients corresponding to the three loss terms are respectively.

3. The method for estimating the scene structure depth of the fisheye image facing the indoor according to claim 1, characterized in that: in the step (1), in a scene structure depth estimation network model based on an encoding-decoding strategy is constructed, an encoder takes ResNet-50 as a main network to extract semantic features of input fisheye images, learns the dependency relationship among image pixels and outputs a feature map containing low-dimensional semantic information and high-dimensional semantic information; a fish eye distortion convolution module is adopted in the third to fifth bottleneck blocks in the ResNet-50, so that the learning capacity of a scene structure depth estimation model on the fish eye image distortion is enhanced; the fisheye distortion convolution module is designed by adopting a fisheye image projection model.

4. The method for estimating the scene structure depth of the fisheye image facing the indoor according to claim 1, characterized in that: in the step (1), in constructing a scene structure depth estimation network model based on an encoding-decoding strategy, a decoder is implemented as follows:

taking a feature graph obtained by an encoder as input, and constructing feature decoding based on an upward mapping layer module; the decoder comprises four upward mapping layer modules which are responsible for increasing the resolution of the feature map and realizing the decoding of semantic features, maps the learned distributed feature representation to a sample mark space through a supervised end-to-end learning mode and outputs a predicted structure depth map; and each upward mapping layer module adopts a residual error structure design.