CN113192149B

CN113192149B - Image depth information monocular estimation method, apparatus and readable storage medium

Info

Publication number: CN113192149B
Application number: CN202110554113.6A
Authority: CN
Inventors: 王飞; 许强; 郭宇; 张秋光; 张雪涛
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2024-05-10
Anticipated expiration: 2041-05-20
Also published as: CN113192149A

Abstract

The invention discloses a method, equipment and a storage medium for monocular estimation of image depth information, which comprise the steps of taking an image to be estimated as input of a pre-trained self-supervision channel hybrid network; the method comprises the steps of encoding an image to be estimated by using an encoder module to obtain a plurality of semantic feature images with different layers; the semantic layers of the semantic feature graphs are different, and the resolutions are different; mixing and dispersing a plurality of semantic feature images with different layers in the channel direction by utilizing a channel mixing module to obtain fusion features with different resolutions; decoding the fusion features with different resolutions by using a decoder module respectively to obtain depth estimation with corresponding resolution, and obtaining a depth image of the image to be estimated; according to the invention, the decoder module decodes the fusion characteristics to obtain the depth image of the image to be estimated, and the depth image has more reliable local and global information; compared with the existing reference method without the channel mixing module, the depth estimation effect of the invention is greatly improved.

Description

Image depth information monocular estimation method, apparatus and readable storage medium

Technical Field

The invention belongs to the technical field of 3D computer vision, and particularly relates to a method and equipment for monocular estimation of image depth information and a readable storage medium.

Background

Depth estimation is a very important problem in the field of computer vision, and is widely used in the fields of autopilot, virtual reality, and the like. To solve this problem, methods based on various sensor configurations such as monocular cameras, multi-view cameras, and radar depth sensors have been proposed; among them, the depth estimation method based on the monocular camera is the simplest to configure, but because of the dimension ambiguity property of the monocular method, the method is the most difficult. The current stage depth estimation method is best performed by a supervised training method based on deep learning, which relies on a large amount of data with depth truth labels, however, the accurate depth truth acquisition cost is high, and the model trained by the data in a specific scene is difficult to adapt to different scenes, so that the method is difficult to be widely applied. At present, the self-supervision monocular depth estimation method based on image pairs or videos has greatly advanced, no tagged data is needed for training, and all supervision information comes from image texture information and geometric constraints, so that a large number of non-tagged data sets can be widely used for training, and the method can be well adapted to different scenes.

Specifically, the self-supervision monocular depth estimation method only needs a single picture during testing, and can be divided into two types of scenes during training: monocular video sequences and stereo image pairs; the core ideas of the two are that the corresponding relation among pixels is established under different visual angles through the estimated depth map; training methods based on monocular video sequences require simultaneous estimation of depth maps and camera motion. The method based on the stereo image pair only needs to estimate the depth map because the relative position relation between the binocular cameras is calibrated in advance, and the method has more excellent performance compared with the method based on the video sequence.

With the rapid development of the deep learning technology, compared with the traditional method, the self-supervision monocular depth estimation method based on the neural network has greatly improved performance; considering a training method based on stereo image pairs, poggi et al propose learning from image pairs configured by three-phase machines, and depth estimation of intermediate images is respectively dependent on geometric constraint relation with left and right views; tosi et al propose to assist in the training of the supervisory network with conventional methods such as estimation of SGM that is optimized with left and right view straight-on constraints; zhu et al propose to guide the optimization of depth map contours with the result of semantic segmentation; gonzalez et al adopts a mirror image shielding module to estimate shielding areas, so that interference of shielding on network training is effectively solved; most of these methods use a network such as Resnet as an encoder to extract multi-scale and hierarchical features of the image and then use a decoder to derive depth estimates from these features. The method only carries out simple addition or superposition on the channel when fusing the different layers of characteristics, and does not fully utilize the advantages and mutual complementary relations among the different layers of characteristics, so that the network performance is not further improved.

In order to solve the problem that the features in group convolution of different groups are difficult to communicate, channel shuffle operation is proposed, and the features after group convolution are recombined in the channel direction; in order to solve the problem of key point detection in difficult scenes such as occlusion in human body pose estimation, su et al utilizes channel shuffle to merge different features, so that communication of features of each level is enhanced, and detection accuracy is improved.

Currently, in the field of depth estimation, no method is available to try to explore how to more effectively fuse features of different layers in a depth estimation network, and enhance the expressive power of the features.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a monocular estimation method, equipment and a readable storage medium for image depth information, which are used for solving the technical problems that the existing depth estimation method cannot fully fuse the characteristics of different layers, cannot utilize the advantage complementation among the characteristics of different layers, has larger influence on the performance of a depth estimation network and has lower accuracy of a depth information estimation result.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the invention provides an image depth information monocular estimation method, which comprises the following steps:

Taking an image to be estimated as input of a pre-trained self-supervision channel hybrid network; the self-supervision channel mixing network comprises an encoder module, a channel mixing module CSM and a decoder module;

encoding the image to be estimated by using an encoder module to obtain a plurality of semantic feature images with different layers; the semantic layers of the semantic feature graphs are different, and the resolutions are different;

mixing and dispersing a plurality of semantic feature graphs with different layers in the channel direction by using a channel mixing module CSM to obtain fusion features with different resolutions;

and respectively decoding the fusion features with different resolutions by using a decoder module to obtain depth estimation with corresponding resolution, and obtaining a depth image of the image to be estimated.

Further, the encoder module employs a multi-scale feature encoder G _Enc; wherein the multi-scale feature encoder G _Enc is a Resnet network-based encoder; the multi-scale feature encoder G _Enc includes a convolutional layer conv1, an encoder first layer1, an encoder second layer2, an encoder third layer3, and an encoder fourth layer4;

The input of the multi-scale feature encoder G _Enc is an RGB image, and the output is a semantic feature map R-Conv-1, a semantic feature map R-Conv-2, a semantic feature map R-Conv-3, a semantic feature map R-Conv-4 and a semantic feature map R-Conv-5;

the resolutions of the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 are sequentially reduced.

Further, the process of mixing and dispersing a plurality of semantic feature graphs with different layers in the channel direction by using a channel mixing module CSM to obtain the fusion features with different resolutions is specifically as follows:

Performing convolution operation on the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 to obtain a semantic feature map Conv-1, a semantic feature map Conv-2, a semantic feature map Conv-3, a semantic feature map Conv-4 and a semantic feature map Conv-5; wherein, the channel data of the semantic feature graphs Conv-1-5 are the same;

Respectively carrying out up-sampling operation on the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5, so that the resolution of the up-sampled semantic feature maps Conv-2-5 is the same as the resolution of the semantic feature map Conv-1;

combining the semantic feature map Conv-1 with the up-sampled semantic feature maps Conv-2-5, and then performing channel mixing operation to obtain mixed semantic features;

in the channel direction, carrying out uniform segmentation operation on the mixed semantic features to obtain semantic features with five layers of channels with the same number;

Keeping the resolution of the semantic features of the first layer unchanged, and recording the resolution as a semantic feature map C-Conv-1;

Respectively carrying out downsampling operation on the semantic features of the second layer to the semantic features of the fifth layer, so that the resolution of the semantic features after the downsampling operation is respectively corresponding to the resolution of the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5 to obtain a semantic feature map C-Conv-2, a semantic feature map C-Conv-3, a semantic feature map C-Conv-4 and a semantic feature map C-Conv-5;

Performing convolution operation on the semantic feature graphs C-Conv-1-5 to obtain a semantic feature graph S-Conv-1, a semantic feature graph S-Conv-2, a semantic feature graph S-Conv-3, a semantic feature graph S-Conv-4 and a semantic feature graph S-Conv-5;

And correspondingly combining the semantic feature graphs Conv-1-5 and the semantic feature graphs S-Conv-1-5 to obtain fusion features with different resolutions.

Further, the merging operation adopts a concat function; the Split function is adopted for the splitting operation; the upsampling operation uses a nearest upsampling and the downsampling operation uses a nearest downsampling.

Further, the decoder module adopts a deep neural network decoder; the deep neural network decoder comprises a convolution block, an up-sampling layer, a merging layer, a convolution layer and an output layer;

The input of the convolution block is fusion characteristics with different resolutions; the output of the convolution block is connected with the input of the up-sampling layer, the output of the up-sampling layer is connected with the input of the merging layer, the output of the merging layer is connected with the input of the convolution layer, and the output of the convolution layer is connected with the input of the output layer.

Furthermore, the output layer adopts a sigmoid function; performing nonlinear transformation on the output value of the sigmoid function to obtain depth estimation of corresponding resolution;

Wherein, the nonlinear transformation formula is as follows:

d＝1/(a*y+b)

wherein d is depth estimation of the corresponding resolution; a is a linear transform coefficient calculated from the maximum depth; y is the output value of the sigmoid function; b is the linear transform coefficient calculated from the minimum depth.

Further, the training process of the pre-trained self-supervision hybrid network is specifically as follows:

Constructing an image training set; the image sample set comprises a stereoscopic image pair, wherein an image to be estimated in the stereoscopic image pair is marked as a view T, and the other image Zhang Tuji is marked as a view S; randomly carrying out the same downsampling and clipping on the view T and the view S respectively to obtain a clipped view T and a clipped view S;

Performing rough depth estimation on the cut view T to obtain a rough depth map corresponding to the view T; filtering by using the consistency constraint of the rough depth map corresponding to the view T and the view S to obtain a filtered depth map; the filtered depth map is used as a pseudo tag for training;

Taking the view T as the input of a self-supervision channel hybrid network, and outputting to obtain a depth map of the view T;

Converting the depth map of the view T into a point cloud of the view T, and acquiring the pixel position corresponding to each pixel in the view S in the point cloud of the view T;

converting the colors of the corresponding pixels on the view S back to the view T by using a bilinear interpolation method to obtain a generated image T' of the view T;

and respectively carrying out error calculation on the generated images T' of the views T and between the estimated depth and the trained pseudo labels to obtain the trained self-supervision channel hybrid network.

Further, an error function of an error calculation process between the view T and the generated image T' of the view T is an L1 function; the error function for the error calculation process between the estimated depth and the trained pseudo tag is the SSIM function.

The invention also provides image depth information monocular estimation equipment, which comprises a memory, a processor and executable instructions which are stored in the memory and can be run in the processor; and the processor executes the executable instructions to realize the image depth information monocular estimation method.

The invention also provides a computer readable storage medium having stored thereon computer executable instructions which when executed by a processor implement the image depth information monocular estimation method.

Compared with the prior art, the invention has the beneficial effects that:

The invention provides a monocular estimation method, equipment and a readable storage medium of image depth information, which are characterized in that an encoder module is utilized to obtain semantic feature images of different layers of an image to be estimated, a channel mixing module CSM is utilized to mix semantic features of different layers in a channel direction, and the sharing of respective beneficial information of the feature images of different layers is realized; decoding the fusion characteristics through a decoder module to obtain a depth image of the image to be estimated, wherein the depth image has more reliable local and global information; compared with the existing reference method without the channel mixing module CSM, the depth estimation effect of the invention is greatly improved.

Furthermore, the multi-scale feature encoder G _Enc is an encoder based on a Resnet network, has few network parameters and high training speed, and has better feature extraction capability.

Furthermore, the channel mixing module mixes and disperses the features of different layers in the channel direction, so that the new features are fused with the information of the features of all layers; the fusion features not only comprise deep semantic features of low-resolution features, but also have large receptive fields, and help the decoder to understand the depth reasoning of weak textures and occlusion areas and the interrelation between objects; meanwhile, local detail information of the large-resolution feature pair is contained, so that the pixel position depth estimation accuracy of the decoder pair under the general condition is effectively improved.

Further, a depth neural network decoder is adopted to decode fusion features with different resolutions, so that corresponding depth maps are decoded on each resolution of input features; in the decoding process of the depth map of each resolution, the depth map of each resolution not only contains the characteristics output by the encoder of the current resolution, but also contains the characteristics of up-sampling of the last resolution; the former ensures that the original resolution characteristic information extracted by the encoder is not destroyed, and the latter ensures that the information of other layers of characteristics permeates the characteristics of the layer; the depth map decoding of the resolution of each size of the decoder is supervised during training, so that the network can be fully assisted in effectively coping with the scaling problem in the scene.

Further, the self-supervision hybrid network is based on training of stereo image pairs, and the position relationship between the left camera and the right camera is calibrated in advance; the self-supervision hybrid network only needs to estimate the depth map, and the existing camera parameters are directly utilized to calculate the corresponding relation of pixels in the left image and the right image, so that the learning difficulty of the network is greatly reduced, and the estimation accuracy of the depth map is improved; by training the stereo image pair, the estimated depth scale can be ensured to be clear and consistent with the true value of the depth map.

Drawings

FIG. 1 is a network architecture diagram of a self-supervising channel hybrid network in an embodiment;

FIG. 2 is a block diagram of a hybrid channel module in an embodiment;

FIG. 3 is a diagram of a channel mix layer architecture in an embodiment;

FIG. 4 is an original image of depth to be estimated in an embodiment;

FIG. 5 is a depth image estimated by a prior art reference method;

fig. 6 is a depth image estimated by the embodiment method.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the following specific embodiments are used for further describing the invention in detail. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a monocular estimation method of image depth information, which takes an image to be estimated as the input of a pre-trained self-supervision channel hybrid network; the self-supervision hybrid network comprises an encoder module, a channel hybrid module CSM and a decoder module;

The specific process is as follows:

Step 1, coding an image to be estimated by utilizing an encoder module to obtain a plurality of semantic feature diagrams with different layers; wherein, the semantic hierarchy of the semantic feature graphs is different and the resolution is different.

The encoder module adopts a multi-scale feature encoder G _Enc; wherein the multi-scale feature encoder G _Enc is a Resnet network-based encoder; the multi-scale feature encoder G _Enc includes a convolutional layer conv1, an encoder first layer1, an encoder second layer2, an encoder third layer3, and an encoder fourth layer4.

The input of the multi-scale feature encoder G _Enc is an RGB image, and the output is a semantic feature map R-Conv-1, a semantic feature map R-Conv-2, a semantic feature map R-Conv-3, a semantic feature map R-Conv-4 and a semantic feature map R-Conv-5; the resolutions of the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 are sequentially reduced.

And 2, mixing and dispersing a plurality of semantic feature graphs with different layers in the channel direction by utilizing a channel mixing module CSM to obtain fusion features with different resolutions.

The specific process is as follows:

Step 21, performing convolution operation on the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 to obtain a semantic feature map Conv-1, a semantic feature map Conv-2, a semantic feature map Conv-3, a semantic feature map Conv-4 and a semantic feature map Conv-5; wherein, the channel data of the semantic feature graphs Conv-1-5 are the same.

And step 22, respectively performing up-sampling operation on the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5, so that the resolution of the up-sampled semantic feature maps Conv-2 to 5 is the same as the resolution of the semantic feature map Conv-1.

And step 23, combining the semantic feature map Conv-1 with the up-sampled semantic feature map Conv-2-5, and then performing channel mixing operation to obtain mixed semantic features.

And step 24, carrying out uniform segmentation operation on the mixed semantic features in the channel direction to obtain semantic features with five layers of channels with the same number.

Step 25, keeping the resolution of the semantic features of the first layer unchanged, and recording the resolution as a semantic feature map C-Conv-1; and respectively carrying out downsampling operation on the semantic features of the second layer to the semantic features of the fifth layer, so that the resolution of the semantic features after the downsampling operation is respectively corresponding to the resolution of the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5 to obtain a semantic feature map C-Conv-2, a semantic feature map C-Conv-3, a semantic feature map C-Conv-4 and a semantic feature map C-Conv-5.

And 26, respectively carrying out convolution operation on the semantic feature map C-Conv-1, the semantic feature map C-Conv-2, the semantic feature map C-Conv-3, the semantic feature map C-Conv-4 and the semantic feature map C-Conv-5 to obtain a semantic feature map S-Conv-1, a semantic feature map S-Conv-2, a semantic feature map S-Conv-3, a semantic feature map S-Conv-4 and a semantic feature map S-Conv-5.

And step 27, carrying out merging operation on the semantic feature graphs Conv-1-5 and the semantic feature graphs S-Conv-1-5 in a one-to-one correspondence manner to obtain fusion features with different frequencies.

And 3, respectively decoding the fusion features with different resolutions by utilizing a decoder module to obtain depth estimation with corresponding resolution, and obtaining a depth image of the image to be estimated.

Wherein the decoder module adopts a deep neural network decoder; the deep neural network decoder comprises a convolution block, an up-sampling layer, a merging layer, a convolution layer and an output layer; the input of the convolution block is fusion characteristics with different resolutions; the output of the convolution block is connected with the input of the up-sampling layer, the output of the up-sampling layer is connected with the input of the merging layer, the output of the merging layer is connected with the input of the convolution layer, and the output of the convolution layer is connected with the input of the output layer.

In the invention, an output layer adopts a sigmoid function, and nonlinear transformation is carried out on the output value of the sigmoid function to obtain depth estimation of corresponding resolution;

Wherein, the nonlinear transformation formula is as follows:

d＝1/(a*y+b)

In the invention, the training process of the pre-trained self-supervision hybrid network is specifically as follows:

Error calculation is carried out on the view T and the generated image T' of the view T and between the estimated depth and the trained pseudo tag respectively; obtaining a trained self-supervision channel hybrid network; an error function of an error calculation process between the view T and a generated image T' of the view T is an L1 function; the error function for the error calculation process between the estimated depth and the trained pseudo tag is the SSIM function.

According to the image depth information monocular estimation method, for a trained network model, a single test image is input, and the corresponding depth of each pixel of the image is automatically obtained; comparing the obtained depth map with the true value, it is found that very high estimation accuracy is achieved.

The invention also provides an image depth information monocular estimation system, which comprises an encoder module, a channel mixing module CSM and a decoder module;

The encoder module is used for encoding the image to be estimated to obtain a plurality of semantic feature images with different layers; wherein, the semantic hierarchy of the semantic feature graphs is different and the resolution is different.

And the channel mixing module CSM is used for mixing and dispersing the semantic feature images of different layers in the channel direction to obtain fusion features with different resolutions.

And the decoder module is used for respectively decoding the fusion features with different resolutions to obtain depth estimation with corresponding resolutions, and obtaining a depth image of the image to be estimated.

The present invention also provides an image depth information monocular estimation apparatus, comprising: a processor, a memory, and a computer program stored in the memory and executable on the processor, such as an image depth information monocular estimation program.

The processor performs the steps in the image depth information monocular estimation method when executing the computer program, for example, takes an image to be estimated as input of a pre-trained self-supervision channel hybrid network; the self-supervision channel mixing network comprises an encoder module, a channel mixing module CSM and a decoder module; encoding the image to be estimated by using an encoder module to obtain a plurality of semantic feature images with different layers; the semantic layers of the semantic feature graphs are different, and the resolutions are different; mixing and dispersing a plurality of semantic feature graphs with different layers in the channel direction by using a channel mixing module CSM to obtain fusion features with different resolutions; and respectively decoding the fusion features with different resolutions by using a decoder module to obtain depth estimation with corresponding resolution, and obtaining a depth image of the image to be estimated.

Or the processor realizes the functions of each module/unit in the image depth information monocular estimation system when executing the computer program, such as an encoder module, which is used for encoding an image to be estimated to obtain a plurality of semantic feature images with different layers; wherein, the semantic hierarchy of the semantic feature graphs is different and the resolution is different. And the channel mixing module CSM is used for mixing and dispersing the semantic feature images of different layers in the channel direction to obtain fusion features with different resolutions. And the decoder module is used for respectively decoding the fusion features with different resolutions to obtain depth estimation with corresponding resolutions, and obtaining a depth image of the image to be estimated.

The computer program may be divided into one or more units, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more units may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program in the image depth information monocular estimation device. For example, the computer program may be divided into an encoder module, a channel mix module CSM and a decoder module, each module having the following specific functions: the encoder module is used for encoding the image to be estimated to obtain a plurality of semantic feature images with different layers; wherein, the semantic hierarchy of the semantic feature graphs is different and the resolution is different. And the channel mixing module CSM is used for mixing and dispersing the semantic feature images of different layers in the channel direction to obtain fusion features with different resolutions. And the decoder module is used for respectively decoding the fusion features with different resolutions to obtain depth estimation with corresponding resolutions, and obtaining a depth image of the image to be estimated.

The image depth information monocular estimation device can be a computing device such as a desktop computer, a notebook computer, a palm computer and a cloud server. The image depth information monocular estimation device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above is merely an example of an image depth information monocular estimation device and does not constitute a limitation of the image depth information monocular estimation device, and may include more or fewer components, or combine certain components, or different components, e.g., the image depth information monocular estimation device may further include an input output device, a network access device, a bus, etc.

The processor may be a central processing unit (CentralProcessingUnit, CPU), or other general purpose processor, digital signal processor (DigitalSignalProcessor, DSP), application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), off-the-shelf programmable gate array (Field-ProgrammableGateArray, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the image depth information monocular estimation apparatus, and which connects the respective parts of the entire image depth information monocular estimation apparatus using various interfaces and lines.

The memory may be used to store the computer program and/or the module, and the processor may implement various functions of the image depth information monocular estimation apparatus by running or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc.

In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMARTMEDIACARD, SMC), secure digital (SecureDigital, SD) card, flash memory card (FLASHCARD), at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The unit integrated by the image depth information monocular estimation apparatus may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a separate product. Based on this understanding, the present invention may implement all or part of the flow of the above-described method, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of the above-described method when executed by a processor.

Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

Examples

As shown in fig. 1-3, the present embodiment provides a monocular estimation method for image depth information, which includes the following steps:

Step 1, constructing an image training set, wherein the image training set comprises a stereoscopic image pair; the stereo image pair is acquired by adopting a stereo camera, and the inter-position relation of the stereo camera is known.

The image to be estimated in the stereo image pair is marked as a view T, and the other Zhang Tuji is marked as a view S; randomly carrying out the same downsampling and clipping on the view T and the view S respectively to obtain a clipped view T and a clipped view S; simultaneously, the internal parameters of the stereoscopic camera are correspondingly adjusted; preferably, the stereo camera internal parameters comprise a camera focal length and principal point coordinates; the resolution of the stereo camera reference and the stereo image pair are related, the downsampling and cropping the picture changes the resolution of the image, the camera reference must be modified, otherwise, errors are generated when the pixel correspondence between the two views is obtained by means of the camera reference in step 5, and data augmentation is performed.

Step 2, taking the cut view T as input of an SGM algorithm, and performing rough depth estimation on the cut view T to obtain a rough depth map corresponding to the view T; filtering by using the consistency constraint of the rough depth map corresponding to the view T and the view S, and filtering out pixel points with larger errors to obtain a filtered depth map; the filtered depth map is used as a training pseudo tag.

Taking the view T as the input of a self-supervision channel hybrid network, and outputting to obtain a depth map of the view T; converting the depth map of the view T into a point cloud of the view T, and acquiring the pixel position corresponding to each pixel in the view S in the point cloud of the view T; converting the colors of the corresponding pixels on the view S back to the view T by using a bilinear interpolation method to obtain a generated image T' of the view T; and respectively carrying out error calculation on the generated images T' of the views T and between the estimated depth and the trained pseudo labels to obtain the trained self-supervision channel hybrid network.

In this embodiment, an error function in an error calculation process between the view T and the generated image T' of the view T is an L1 function; the error function for the error calculation process between the estimated depth and the trained pseudo tag is the SSIM function.

Step 3, taking the image to be estimated as the input of a pre-trained self-supervision channel hybrid network; the self-supervision hybrid network comprises an encoder module, a channel hybrid module CSM and a decoder module; the specific process is as follows:

step 31, coding the image to be estimated by utilizing an encoder module to obtain a plurality of semantic feature diagrams with different layers; wherein, the semantic hierarchy of the semantic feature graphs is different and the resolution is different.

In this embodiment, the encoder module employs a multi-scale feature encoder G _Enc; wherein the multi-scale feature encoder G _Enc is a Resnet network-based encoder; the multi-scale feature encoder G _Enc includes a convolutional layer conv1, an encoder first layer1, an encoder second layer2, an encoder third layer3, and an encoder fourth layer4.

The input to the multi-scale feature encoder G _Enc is an RGB image; the output is a semantic feature map R-Conv-1, a semantic feature map R-Conv-2, a semantic feature map R-Conv-3, a semantic feature map R-Conv-4 and a semantic feature map R-Conv-5; the resolutions of the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 are sequentially reduced; in this embodiment, the resolutions of the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 are sequentially 1/2, 1/4, 1/8, 1/16 and 1/32 of the resolution of the image to be estimated.

In this embodiment, the multi-scale feature encoder G _Enc is a Resnet network-based encoder; the multi-scale feature encoder G _Enc encodes the input image into feature images with different semantic levels and resolutions, and based on the Resnet network encoder, the input image extracts 5 semantic feature images with different depths of the network, and each of the semantic feature images R-Conv-1-5 contains different semantic information, so that the input image has important effects on different aspects of depth estimation.

Step 32, mixing and dispersing a plurality of semantic feature graphs with different layers in the channel direction by utilizing a channel mixing module CSM to obtain fusion features with different resolutions; in this embodiment, the channel mixing module CSM includes a first convolution layer, an upsampling layer, a merging layer, a channel mixing layer, a dividing layer, a downsampling layer, and a second convolution layer.

The first convolution layer comprises five convolution layers with convolution kernel sizes of (256, C, 1), wherein the number of channels of the characteristics of the C and each layer is consistent, and Conv-1-5 is obtained after convolution operation; the up-sampling layer comprises four nearest up-sampling layers for up-sampling Conv 2-5 respectively with the same resolution as Conv-1; the merging layer is used for merging all the features after upsampling in the channel direction; the sequence of the internal operation of the channel mixing layer is reshape-transferring-reshape in sequence; the segmentation layer is used for equally dividing the mixed characteristics on the channels to obtain five characteristics with the same channel number; the downsampling layer comprises four nearest downsampling layers and is used for downsampling the characteristics from the second layer to the last layer output by the segmentation layer, and the resolution of the output characteristics is kept consistent with Conv 2-5 respectively to obtain characteristics C-Conv 1-5; the second convolution layer comprises five convolution layers of convolution kernel size (C, 1), where C is 256.

The specific process is as follows:

Step 321, performing 1x1 convolution operation on the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 to obtain a semantic feature map Conv-1, a semantic feature map Conv-2, a semantic feature map Conv-3, a semantic feature map Conv-4 and a semantic feature map Conv-5; wherein, the channel data of the semantic feature graphs Conv-1-5 are the same; in this embodiment, the number of channels of the semantic feature graphs Conv-1-5 is 256.

Step 322, respectively performing up-sampling operation on the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5, so that the resolution of the up-sampled semantic feature maps Conv-2 to 5 is the same as the resolution of the semantic feature map Conv-1; wherein the upsampling operation uses nearest sampling.

Step 323, combining the semantic feature map Conv-1 with the up-sampled semantic feature maps Conv-2-5 in the channel direction, and then performing channel mixing operation to obtain mixed semantic features.

And 324, carrying out uniform segmentation operation on the mixed semantic features in the channel direction to obtain semantic features with five layers of channels with the same number.

Step 325, keeping the resolution of the semantic features of the first layer unchanged, and recording the resolution as a semantic feature map C-Conv-1; respectively carrying out downsampling operation on the semantic features of the second layer to the semantic features of the fifth layer, so that the resolution of the semantic features after the downsampling operation is respectively the same as the resolution of the semantic feature images Conv-2, conv-3, conv-4 and Conv-5 to obtain semantic feature images C-Conv-2, C-Conv-3, C-Conv-4 and C-Conv-5; wherein the downsampling operation employs a nearest sample.

And 326, respectively performing 1x1 convolution operation on the semantic feature map C-Conv-1, the semantic feature map C-Conv-2, the semantic feature map C-Conv-3, the semantic feature map C-Conv-4 and the semantic feature map C-Conv-5 to obtain fusion features with different resolutions, namely, the semantic feature map S-Conv-1, the semantic feature map S-Conv-2, the semantic feature map S-Conv-3, the semantic feature map S-Conv-4 and the semantic feature map S-Conv-5.

Step 327, in the channel direction, carrying out merging operation on the semantic feature map S-Conv-1, the semantic feature map S-Conv-2, the semantic feature map S-Conv-3, the semantic feature map S-Conv-4 and the semantic feature map S-Conv-5 in one-to-one correspondence with the semantic feature map Conv-1, the semantic feature map Conv-2, the semantic feature map Conv-3, the semantic feature map Conv-4 and the semantic feature map Conv-5 to obtain fusion features with different resolutions; the fusion features with different resolutions are the enhancement feature pyramid and are used as the input of the depth decoder.

In the embodiment, a plurality of semantic feature graphs with different levels are mixed and dispersed in the channel direction through a channel mixing module CSM to obtain fusion features with different resolutions; in order to more effectively utilize the advantages of different layers of features, the embodiment provides a channel mixing module CSM, which performs channel mixing operation between different layers of features to obtain mixed multi-level enhanced pyramid features, so that the different layers of features are fully communicated on a channel, and the advantages are complementary.

And 33, taking the fusion features with different resolutions as input of a decoder module, wherein the decoder module adopts a depth neural network decoder, and decodes the fusion features with different resolutions by using the depth neural network decoder to sequentially obtain depth estimation with corresponding resolutions, namely obtaining a depth image of the image to be estimated.

In this embodiment, the deep neural network decoder includes a convolution block ConvBlock, an upsampling layer, a merging layer, a convolution layer, and an output layer; the input of the convolution block is fusion characteristics with different resolutions; the output of the convolution block is connected with the input of the up-sampling layer, the output of the up-sampling layer is connected with the input of the merging layer, the output of the merging layer is connected with the input of the convolution layer, and the output of the convolution layer is connected with the input of the output layer;

Specifically, the input of convolution block ConvBlock is the previous adjacent resolution feature and the output is the same resolution feature; the up-sampling layer is a 2-time up-sampling layer, and the up-sampling layer carries out 2-time up-sampling on the features with the same resolution to obtain the features with the current resolution; the merge layer input contains 2 times the up-sampled feature of the current resolution and the feature of the current resolution of the encoder output, the output being the feature of the merge in the channel direction.

The convolution layer input is the characteristics after combination in the channel direction, and the output is the characteristics with the channel number of 1; the output layer input is a feature with a channel number of 1, the value of each pixel output represents its corresponding depth distribution probability, and the depth distribution probability is between 0 and 1.

In this embodiment, the convolution block ConvBlock includes a convolution layer and an activation function; the previous low-resolution feature is added to the current high-resolution feature as a skip layer feature after upsampling, so that the depth estimation of the current resolution can be further enhanced.

The output layer adopts a sigmoid function; performing nonlinear transformation on the output value of the sigmoid function to obtain depth estimation of corresponding resolution; wherein, the nonlinear transformation formula is as follows:

d＝1/(a*y+b)

In the embodiment, the depth neural network decoder estimates a corresponding depth map on each resolution characteristic through the combination of convolution and an activation function, and fully utilizes the advantages of multi-scale supervised learning; simultaneously, the low-resolution features are transmitted through the jumping layer and combined with the high-resolution features, so that the high-resolution depth map is estimated together.

As shown in fig. 4 to 6, fig. 4 shows an original image of a depth to be estimated in the embodiment, fig. 5 shows a depth image estimated from the original image of the depth to be estimated using the existing reference method, and fig. 6 shows a depth image estimated using the estimation method described in the embodiment; as can be seen from fig. 4 to 6, in the estimation method according to the embodiment, the outline of the object is clearer on the estimated depth map; in particular, the depth estimation effect for elongated objects is more excellent.

According to the image depth information monocular estimation method, an input image is encoded into feature images with different semantic levels and resolutions by using a multi-scale feature encoder G _Enc; mixing and dispersing all the features in the channel direction through a channel mixing module CSM to obtain fusion features; the decoder decodes the corresponding depth map on the fusion characteristics of each resolution ratio in sequence; the method is applied to the field of self-supervision monocular depth estimation, and can obtain a better estimation effect.

The description of the relevant parts in the image depth information monocular estimation device and the computer readable storage medium provided in this embodiment may refer to the detailed description of the corresponding parts in the image depth information monocular estimation method described in this embodiment, which is not repeated here.

The image depth information monocular estimation method fully utilizes the characteristic advantages of different semantic layers obtained by the encoder, and helps a decoder obtain better estimation precision; the encoder extracts higher resolution features in a shallow layer, has smaller receptive field, and is favorable for improving the accuracy of pixel position depth estimation under simple conditions; the encoder deeply extracts lower resolution features, has larger receptive field, and is favorable for the depth reasoning of pixels under difficult conditions such as weak textures; the channel mixing module mixes the features obtained by the encoder in the channel direction, so that the features with different semantic hierarchies are communicated in the channel direction, and then the features are uniformly separated to obtain new fusion features. The fusion feature and the original feature of the encoder are combined together to be used as the input feature of the decoder of the next stage; the corresponding depth map is estimated by the decoder according to the features of each resolution, and the features of the upper low resolution are transmitted by the jump layer and combined with the features of the lower high resolution to jointly estimate the high resolution depth map.

The above embodiment is only one of the implementation manners capable of implementing the technical solution of the present invention, and the scope of the claimed invention is not limited to the embodiment, but also includes any changes, substitutions and other implementation manners easily recognized by those skilled in the art within the technical scope of the present invention.

Claims

1. A method for monocular estimation of image depth information, comprising the steps of:

Decoding the fusion features with different resolutions by using a decoder module respectively to obtain depth estimation with corresponding resolution, and obtaining a depth image of the image to be estimated;

The training process of the pre-trained self-supervision hybrid network is specifically as follows:

Error calculation is respectively carried out between the view T and the generated image T' of the view T and between the estimated depth and the trained pseudo tag, so as to obtain a trained self-supervision channel hybrid network;

The encoder module adopts a multi-scale feature encoder G _Enc; wherein the multi-scale feature encoder G _Enc is a Resnet network-based encoder; the multi-scale feature encoder G _Enc includes a convolutional layer conv1, an encoder first layer1, an encoder second layer2, an encoder third layer3, and an encoder fourth layer4;

The resolutions of the semantic feature map R-Conv-1, the semantic feature map R-Conv-2, the semantic feature map R-Conv-3, the semantic feature map R-Conv-4 and the semantic feature map R-Conv-5 are sequentially reduced;

The process of mixing and dispersing a plurality of semantic feature graphs with different layers in the channel direction by utilizing a channel mixing module CSM to obtain fusion features with different resolutions is specifically as follows:

2. The method of monocular estimation of image depth information according to claim 1, wherein the merging operation uses a concat function; the Split function is adopted for the splitting operation; the upsampling operation uses a nearest upsampling and the downsampling operation uses a nearest downsampling.

3. The method of claim 1, wherein the decoder module employs a deep neural network decoder; the deep neural network decoder comprises a convolution block, an up-sampling layer, a merging layer, a convolution layer and an output layer;

4. A method of monocular estimation of image depth information according to claim 3, wherein the output layer uses a sigmoid function; performing nonlinear transformation on the output value of the sigmoid function to obtain depth estimation of corresponding resolution;

Wherein, the nonlinear transformation formula is as follows:

d＝1/(a*y+b)

5. The method as claimed in claim 1, wherein an error function of an error calculation process between the view T and the generated image T' of the view T is an L1 function; the error function for the error calculation process between the estimated depth and the trained pseudo tag is the SSIM function.

6. An image depth information monocular estimation apparatus comprising a memory, a processor, and executable instructions stored in the memory and executable in the processor; the processor, when executing the executable instructions, implements the method of any of claims 1-5.

7. A computer readable storage medium having stored thereon computer executable instructions which when executed by a processor implement the method of any of claims 1-5.