CN113034563A

CN113034563A - Self-supervision type monocular depth estimation method based on feature sharing

Info

Publication number: CN113034563A
Application number: CN202110196301.6A
Authority: CN
Inventors: 杨明; 李雪; 范圣印; 陈禹行
Original assignee: Suzhou Yihang Yuanzhi Intelligent Technology Co Ltd
Current assignee: Suzhou Yihang Yuanzhi Intelligent Technology Co Ltd
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-06-25

Abstract

A feature sharing-based self-supervision type monocular depth estimation method adopts a new single network structure, integrates an attitude estimation module into a depth estimation module, realizes the integration of two functions of depth estimation and attitude estimation, and obtains a feature sharing-based monocular depth estimation network, wherein the network comprises the following components: the device comprises a feature coding unit, a depth estimation unit, an attitude estimation unit and a supervision training unit. The attitude estimation unit realizes the real-time attitude output of the video streaming based on a characteristic matching mode, and improves the accuracy of attitude estimation; the depth estimation unit is based on an efficient coding and decoding module, so that the calculation efficiency is improved; the output of the depth estimation unit and the output of the attitude estimation unit are combined, and the supervision information is extracted from the original picture, so that the self-supervision network training process is completed; the problem of real-time acquisition of high-precision monocular depth information in unmanned driving is effectively solved.

Description

Self-supervision type monocular depth estimation method based on feature sharing

Technical Field

The disclosure relates to the technical field of depth perception and computer vision in the unmanned driving industry, in particular to a feature sharing-based self-supervision monocular depth estimation method in an unmanned driving scene, and particularly relates to a feature sharing-based self-supervision monocular depth estimation method implemented by a single network structure.

Background

With the continuous development of computer vision technology, the three-dimensional scene perception task plays a crucial role in the unmanned industry. The three-dimensional perception task is different from two-dimensional perception and is mainly embodied in perception detection of information such as pedestrians, vehicles and obstacles around the unmanned vehicle in a real three-dimensional space. The information acquired by three-dimensional perception is a key basis for the unmanned vehicle to make vehicle motion decisions, and the depth information is a basis for a three-dimensional scene perception task. The three-dimensional sensing system needs to acquire depth information of a scene acquired by a monocular camera in a monocular acquisition system only comprising one camera or any one monocular camera in a multi-view acquisition system comprising a plurality of monocular cameras in view of the limitation of the installation layout of actual unmanned vehicle sensor cameras and the requirement of accurately detecting information such as pedestrians, vehicles, obstacles and the like around an unmanned vehicle in a dynamic change situation, so that the problem of acquiring the depth information of a monocular camera application situation or an effective scene information only from one monocular camera is solved. In fact, the above situation occurs sometimes during the dynamic operation of the unmanned vehicle, and therefore, the acquisition of depth information by using a monocular camera is a key research content in the three-dimensional scene perception task.

Currently, for a monocular depth estimation task, the existing research can be mainly divided into a supervised type direction and an unsupervised type direction, wherein a supervised monocular depth estimation network generally has excellent performance on a specific data set, but the problems of multiple network model parameters, difficulty in obtaining labeled data and the like exist; the unsupervised monocular depth estimation network can be flexibly applied to different data sets, but the problems of poor network model precision, unreasonable network training strategy and the like exist.

To understand the state of the art, the present disclosure searches, compares and analyzes existing patents and papers:

patent document CN 108961327 a, "monocular depth estimation method and apparatus, device, and storage medium" proposes a method for performing monocular depth estimation network training using binocular information, which obtains a monocular depth estimation network by using synthesis and real binocular sample data training, respectively. The method uses a small amount of real binocular data and a large amount of synthetic binocular data, and the effect is relatively dependent on the precision of the synthetic data; meanwhile, the method needs real binocular data to adjust the network, so that the method can only perform training and learning in an off-line mode, and the cost of data acquisition is increased.

In patent document CN 111680554 a, "depth estimation method and apparatus for automatic driving scene, and autonomous vehicle", a cascade network manner is proposed to optimize and complement the result of monocular depth estimation, first, a depth estimation model is used to generate basic depth estimation information, and then, a deviation estimation network is used to output deviation estimation information of a target region, thereby solving the problem of insufficient depth estimation accuracy of the target region. In order to improve the accuracy of monocular depth estimation on a target area, the method introduces a target detection method and a cascade network, the overall scale of the network is enlarged, the network structure is complex, the system cost is high, the calculation consumption is large, and the neural network method is difficult to operate in real time.

In patent document CN 110599533 a, "fast monocular depth estimation method applicable to embedded platform", a lightweight monocular depth estimation method on an embedded platform is proposed, in which a lightweight depth estimation network is deployed on the embedded platform, a model training framework is configured on an edge server, and the two interact through the network: the embedded platform provides data and labels for the edge server, and the edge server trains and updates the server on the embedded platform after obtaining the data. The method provides a method for carrying out depth estimation network deployment on an embedded platform, but the method adopts an RGB-D camera to collect monocular pictures and depth maps, is limited by the limitation of the RGB-D camera, has limited depth map perception range, is limited in indoor scenes, is generally applied to indoor robot motion occasions, and is not suitable for outdoor vehicle-side occasions.

It can be seen that, in unmanned driving, the existing monocular depth estimation method cannot meet the requirements of unmanned driving on detection precision, stability and real-time performance no matter on network model precision or training strategy, and is difficult to obtain satisfactory comprehensive effect on a low-power-consumption vehicle-mounted processor. Therefore, a new monocular depth estimation method needs to be researched, the monocular depth estimation precision can be ensured, the requirement of an outdoor unmanned vehicle end can be met, extra calculation overhead is not increased, the method can be used for a low-power-consumption vehicle-mounted processor, and meanwhile, the complex and high-cost sensor system support is not needed.

Disclosure of Invention

In order to adapt to an unmanned application occasion and carry out real-time, efficient, accurate and reliable depth estimation aiming at a monocular camera to acquire a video image so as to effectively acquire depth information in three-dimensional scene perception, the disclosure provides a novel self-supervision monocular depth estimation method. The method adopts a single network structure, does not need two networks of an independent depth estimation network and an independent attitude estimation network, and utilizes feature sharing to integrate an attitude estimation unit into the depth estimation unit, so that the integration of two parts of depth estimation and attitude estimation in a single network is realized, and a brand-new monocular and monocular source depth estimation network based on feature sharing is obtained, so that the network structure is simplified, the network processing speed is accelerated, and the real-time depth of an object detected by a monocular camera video stream is determined in real time.

The monocular source depth estimation network includes: the device comprises a feature coding unit, a depth estimation unit, an attitude estimation unit and a supervision training unit.

The attitude estimation unit realizes real-time attitude output of video streaming based on a characteristic matching mode, and improves the accuracy of attitude estimation.

The depth estimation unit is based on an efficient coding and decoding module, wherein a depth separable convolution, an SE module and a residual convolution module are mixed based on a mixed convolution encoder, and the hollow convolution is combined, so that the calculation efficiency is improved, and high-precision depth output is realized.

And the output of the depth estimation unit and the output of the attitude estimation unit are combined, the supervision information is provided from the original picture, and the self-supervision network training process is realized. The present frame feature and the image frame are obtained.

Therefore, the self-supervision monocular depth estimation method based on feature sharing is obtained.

The method reduces the network calculation amount and the video memory occupation, improves the output precision of the attitude estimation network, reduces the requirements of the network on the video memory and the calculation resources, and improves the calculation efficiency; and a high-efficiency characteristic coding module and a high-efficiency decoding module are designed, so that the output precision of the network is improved while the calculated amount and the parameter amount are reduced, and the real-time performance of a depth estimation task is ensured.

Specifically, to solve the above technical problem, according to an aspect of the present invention, there is provided an auto-supervised monocular depth estimation method based on feature sharing, wherein:

a single network structure is adopted, the attitude estimation unit is integrated into the depth estimation unit, the integration of the operations of the depth estimation and the attitude estimation is realized, and a monocular single-source depth estimation network based on feature sharing is obtained; the monocular source depth estimation network includes: the system comprises a shared feature coding unit, a depth estimation unit, an attitude estimation unit and a supervision training unit;

the method comprises the following steps:

acquiring data, namely acquiring data from a video stream through a monocular camera and outputting an image frame;

step two, sharing feature coding, namely, after receiving the image frame, preprocessing the image frame, and outputting a multi-scale sharing feature group through an encoder;

decoding, namely receiving the multi-scale shared feature group, processing the multi-scale features through a depth estimation unit, and outputting a depth map under the original resolution; performing feature matching and decoding on the features of the current frame and the features of the previous frame through a pose estimation unit, and outputting pose transformation between the two frames;

and step four, loss calculation, namely combining the depth map output by the depth estimation unit with the posture transformation between the two frames output by the posture estimation unit, reconstructing the target frame, and further supervising the training of the network through the difference between the original target frame and the reconstructed target frame.

Preferably, the monocular camera is deployed on an unmanned vehicle.

Preferably, the monocular camera is disposed at an upper edge of a front windshield of the unmanned vehicle.

Preferably, the target frame is reconstructed by projection and interpolation operations.

Preferably, the method further comprises the following steps:

and step five, storing, namely storing the original image of the frame and the features output in the feature encoding step in a storage medium for the operation of the decoding step and the loss calculating step at the next moment.

Preferably, the monocular camera resolution is 720P or greater;

the monocular camera is a monocular camera in a monocular acquisition system, or any one of monocular cameras in a multi-view acquisition system comprising a plurality of monocular cameras.

Preferably, in the first step, in the video stream generated by the monocular camera, the sampling is performed in real time according to a certain frequency to generate the image frame.

Preferably, in encoding the shared features, a hybrid convolution encoder is employed that blends the depth separable convolution, SE module (i.e., compress and activate the Squeeze-Excitation module) with the residual convolution module and combines with the hole convolution.

Preferably, the image is processed by using a deep neural network, and the features of each downsampling are stored to generate a multi-scale feature set.

Preferably, the deep neural network comprises a deep residual network (ResNet) series network.

Preferably, in the process of the mixed convolution, 1 × 1L-dimensional convolution is adopted to improve the dimension space of the features, and then 3 × 3 channel-by-channel convolution and 1 × 1 point-by-point convolution are combined into a depth separable convolution, so that the calculation efficiency of the features is improved, and the parameter quantity of the model is reduced; the SE module is formed by two times of full connection, and the feature expression capacity is enhanced by re-evaluating the importance of the feature channel; the receptive field in the feature extraction process is ensured by introducing a hole convolution in the channel-by-channel convolution.

Preferably, the depth separable convolution decomposes the standard convolution into two steps:

firstly, performing channel-by-channel convolution, wherein each convolution kernel is only responsible for one channel;

secondly, performing point-by-point convolution, reducing a convolution kernel to 1 multiplied by 1, wherein the number of channels is the same as the number of input characteristic channels, and mixing the channel information of the characteristics to obtain enhanced characteristics;

the calculated amount cal (dw) and the parameter amount parm (dw) of the depth separable convolution are respectively:

Cal(DW)＝K_c×K_c×C_in×C_out×W_out×H_out (3)

Parm(DW)＝K_c×K_c×C_out×C_in (4)

wherein, C_inNumber of channels being input feature map, C_outNumber of channels, K, of the output profile_cIs the size of the convolution kernel, W_outAnd H_outRespectively, the width and height of the output signature.

Preferably, the SE module is adopted to learn the correlation among the channels in the feature map, each channel is evaluated and scored, and then the feature channels are selectively screened.

Preferably, the operation of the SE module comprises:

firstly, compressing a feature map, and performing global average pooling on the feature map with the dimension of C multiplied by W multiplied by H to obtain the feature map with the dimension of 1 multiplied by C, wherein the feature map has a global receptive field;

and secondly, carrying out feature excitation, carrying out primary channel dimension global information interaction on the 1 multiplied by C feature graph by utilizing two times of full connection operation, finally calculating the score of each channel through a Sigmoid activation function, and finally multiplying the score by the original feature to obtain the feature graph after the information channel is weighted.

Preferably, hole convolution is introduced into the fifth layer (layer5) and the sixth layer (layer6) of the network layer, and the low hole rate is selected to ensure that the skeleton (backbone) network extracts the characteristics of high receptive field and high resolution;

the max pooling layer is removed and two layers of normal convolution are added after the sixth layer (layer6), the seventh layer (layer7) and the eighth layer (layer8), with hole rates of 2 and 1, respectively, and its hopping connection is removed to obtain a smoother net output.

Preferably, in the multi-scale feature output, the features output by the second layer (layer2), the third layer (layer3), the fourth layer (layer4), the fifth layer (layer5) and the eighth layer (layer8) are respectively selected to form a multi-scale feature map set to represent the features under different scales, and the multi-scale feature map set is used for feature decoding in a decoding step.

Preferably, the depth estimation unit performs a feature fusion operation using a fusion module based on a spatial attention mechanism, f_EAs a feature of the encoder side, f_DFor the features at the decoder side, the f is_EAnd f_DThe packed features, denoted f _ h respectively, which are subjected to a 1 × 1 convolution in each case for obtaining a reduction in dimensionality_EAnd f _ h_DThen the feature f _ h_DAnd the feature f _ h_ESplicing, compressing the features to 1 dimension through a layer of 3 multiplied by 3 convolution after activation, and then connecting a Sigmoid function to output a weight distribution graph sigma, wherein the weight distribution graph sigma represents the screening of the encoder end information after being combined with the decoder end information; then, the weight distribution graph σ and the original encoder feature f_EPoint-by-point multiplication and finally with decoder information f_DAnd splicing for deep decoding.

Preferably, the attitude estimation unit adoptsMatching of features is performed by correlation calculation which accepts feature maps f from two frames separately₁、f₂To f for₁、f₂In any one of the features x₁，x₂Performing correlation calculation on the feature blocks of (2k +1) × (2k +1) as the centers; wherein, for f₁Without calculating f for any feature block in (1)₂The similarity of all the feature blocks in (1) is calculated, but only f is calculated₂Similarity of feature blocks in a length range of up, down, left and right, and d with the corresponding positions; the calculation is shown in equation (5):

wherein f is₁()，f₂() Respectively representing input profiles, x₁，x₂Represents the center of computation, k represents the range of computation,<·>represents a dot product operation, o represents a moving step in a local area, c (x)₁，x₂Represents the final calculated value in x₁And x₂The feature map point which is taken as the center is multiplied by the operation result; wherein the calculation direction of the formula (5) is unidirectional, not satisfying the commutative law, i.e. c (x)₁，x₂)≠c(x₂,x₁) Therefore, the unidirectionality of the attitude transformation in the attitude estimation process is ensured.

Preferably, dense convolution is adopted as a decoder to decode the correlation between the features, and finally the output of the attitude transformation matrix between two frames is realized through one layer of dense convolution and three layers of convolution.

Preferably, the loss calculation includes:

reconstruction of the target frame: calculating the corresponding relation of coordinates in the source frame and the target frame by using the depth map and the attitude transformation matrix which are output by decoding, and further reconstructing the target frame;

calculation of the loss function: the L1 loss, the structural similarity loss and the edge smoothing loss are used as loss functions.

Preferably, the source frame reconstructs the target frame by using the depth map output by the depth estimation unit and the pose transformation relation between the source frame and the target frame output by the pose estimation unit, the point of the two-dimensional space is projected into the three-dimensional space through the back projection operation, the point of the three-dimensional space is projected into the coordinate space of the adjacent frame by using the coordinate system transformation and projection operation, and the target frame is reconstructed by using the image sampling module; and finally, extracting supervision information by using the pixel relation between the reconstructed target frame and the original target frame, and training and supervising the network.

Preferably, the depth map D of the target frame output by the network is utilized_tAnd projecting the pixel points of the target frame into a camera coordinate system under the target frame to generate a sparse point cloud PC (personal computer)_tThe calculation formula is as follows:

PC_t(p_t)＝D_t(p_t)K^-1p_t (6)

wherein p is_tRepresenting the coordinates of any pixel point in the target frame;

the coordinate system conversion step utilizes a pose transformation matrix T of the source frame and the target frame output by the pose estimation unit_t→sPC (personal computer) for sparse point cloud_tTransforming to the source frame coordinate system to obtain point cloud PC_sThe calculation formula is as follows:

PC_s＝T_t→sPC_t＝R_t→sPC_t+t_t→s (7)

wherein R is_t→sAnd t_t→sThe rotation matrix and the translation vector are both output by the attitude estimation unit;

the projection module receives a sparse point cloud PC under a source frame coordinate system_sThen, the point cloud is re-projected to a pixel coordinate system of the source frame by using the internal reference K to obtain corresponding point coordinates p'_sThe calculation formula is as follows:

p′_s＝KPC_s(p_t) (8)

obtaining the corresponding relation of the pixels between two frames as follows:

p′_s＝KT_t→sD_t(p_t)K^-1p_t (9)

wherein the camera internal reference K is a pre-markConstant value, position and posture transformation matrix T_t→sAnd depth map D_tThe obtained result is output by the attitude estimation unit and the depth estimation unit respectively.

Preferably, the loss function comprises: l1 loss, Structural SIMilarity (SSIM) loss, and edge smoothing loss.

Preferably, the L1 loss takes the absolute value of the difference in pixel level between the original target frame and the reconstructed target frame as a loss function, as shown in equation (10):

wherein S represents a sequence of consecutive pictures in a training session, I_t(p) and I'_t(p) respectively representing the pixel values of a certain pixel point of the original target frame and the reconstructed target frame;<I₁,…,I_N>representing the sequence of images numbered in time sequence during one training, corresponding to the 1 st frame to the Nth frame respectively.

Preferably, the structural similarity loss function describes the similarity between an uncompressed original picture and a compressed distorted picture, and the structural similarity loss function constrains the quality of a reconstructed picture in terms of brightness, contrast, and structure, which is calculated as follows:

SSIM(x,y)＝[l(x，y)^α·[c(x,y)]^β·[s(x,y)]^γ (14)

wherein l (x, y) represents lightDegree similarity, c (x, y) denotes contrast similarity, s (x, y) denotes structural similarity; x and y are coordinate values, and the parameter alpha, beta, gamma, 1, u_x,u_yAre the mean values of x, y, σ, respectively_x,σ_yAre the variance, σ, of x, y, respectively_xyIs the covariance of x and y; parameter c₁＝(k₁L)²，c₂＝(k₂L)²Wherein L is the value range of the pixel value; parameter k₁＝0.01，k₂＝0.03，c₃＝0.5c₂；

Deriving a similarity metric loss L between two frames_ssim(x, y) is shown in equation (15):

preferably, the edge smoothing loss L_smoothAs shown in equation (16):

wherein d is_tFor the depth value of the frame at time t,

is d_tThe average value of (a) of (b),

namely, the average normalized inverse depth information; i is_tIs the pixel value;

and

representing the differential operation in the x and y dimensions, respectively.

Preferably, the modulus of S takes 3 or 5.

Preferably, the storage medium includes a fixed memory or a video memory, and a space for storing the features and the pictures in the storage medium is kept fixed.

To solve the above technical problem, according to another aspect of the present invention, there is provided a readable storage medium, wherein:

the readable storage medium stores executable instructions, and the executable instructions are executed by a processor to implement the above-mentioned feature sharing-based self-supervision monocular depth estimation method.

To solve the above technical problem, according to still another aspect of the present invention, there is provided an auto-supervised monocular depth estimation system based on feature sharing, including:

a memory storing a program for executing the above feature sharing-based self-supervised monocular depth estimation method;

a processor; the processor executes the program.

To solve the above technical problem, according to still another aspect of the present invention, there is provided an unmanned vehicle including:

an onboard processor that executes the above feature sharing based method of auto-supervised monocular depth estimation.

The beneficial effect of this disclosure:

1. the hybrid convolution module is provided, and can quickly, efficiently and accurately carry out feature coding;

2. the attitude estimation decoder based on feature matching is provided, and high-precision attitude transformation output between two frames can be realized;

3. the deep estimation method for feature sharing is provided, a large amount of video memory can be saved in network training, the inference speed of the network is increased, and online learning can be realized on equipment with low power consumption.

4. Real-time and accurate depth estimation of the monocular camera is realized;

5. the simplified single network structure improves the calculation efficiency and simultaneously ensures the accuracy of the system;

drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure. The above and other objects, features, and advantages of the present disclosure will become more apparent from the detailed description of the embodiments of the present disclosure when taken in conjunction with the accompanying drawings.

FIG. 1 is an overall flow diagram;

FIG. 2 is a schematic diagram of a hybrid convolution module;

FIG. 3 is a feature fusion module based on a spatial attention mechanism;

FIG. 4 is a schematic diagram of an attitude estimation network;

fig. 5 is a schematic diagram of target frame reconstruction.

Detailed Description

The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. Unless otherwise indicated, the illustrated exemplary embodiments/examples are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Accordingly, unless otherwise indicated, features of the various embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concept of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising" and variations thereof are used in this specification, the presence of stated features, integers, steps, operations, elements, components and/or groups thereof are stated but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximate terms and not as degree terms, and as such, are used to interpret inherent deviations in measured values, calculated values, and/or provided values that would be recognized by one of ordinary skill in the art.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

It is an object of the present disclosure to provide a new method of self-supervised monocular depth estimation.

In order to reduce the data acquisition cost and adapt to an online training learning mode, binocular data are not used, namely real binocular data or synthesized binocular data are not needed, monocular depth estimation is carried out only by means of data acquired by a monocular camera, wherein the monocular camera can be a monocular camera in a monocular acquisition system or any one of monocular cameras in a multi-ocular acquisition system comprising a plurality of monocular cameras.

In order to adapt to the processing capacity of a low-power-consumption vehicle-mounted processor and simultaneously be applied to unmanned driving which cannot provide support of a complex and high-cost sensor system, the method redesigns a network structure and adjusts a training strategy, adopts a new single network structure to realize fusion of two parts of depth estimation and attitude estimation, and fuses the attitude estimation into the depth estimation, so that the network structure is simplified, and the calculation overhead of the system is reduced in the algorithm principle.

The problem of complex network structure in the prior art that the depth information and the attitude information are respectively obtained by respectively adopting the depth estimation network and the attitude estimation network is solved in principle by adopting a single network structure, the occupation of video memory and computing resources is less, the system has good real-time performance, and the method is very suitable for being used on a low-power-consumption vehicle-mounted processor. The overall speed and performance of the network are also improved.

By adopting the single network structure, a brand-new monocular and unisource depth estimation network based on feature sharing is obtained. The monocular source depth estimation network includes: the device comprises a feature coding unit, a depth estimation unit, an attitude estimation unit and a supervision training unit.

In order to improve the precision of attitude estimation and the real-time processing capacity of a system, a monocular camera is adopted to collect data from a video stream, and the real-time attitude output of a video stream type is realized by adopting a characteristic matching-based mode, namely the data processed by the network disclosed by the invention is the real-time video stream type data instead of static image data, so that the real-time performance and the accuracy of output information are ensured, and the precision of system depth estimation is improved. The single network structure processes the video streaming data in real time to quickly and accurately determine the depth of an object in front of the unmanned vehicle in real time.

In order to realize high-precision depth output, a high-efficiency coding and decoding module is designed, wherein a depth separable convolution, an SE module and a residual convolution module are mixed based on a mixed convolution encoder, and the calculation efficiency is greatly improved by combining with the hole convolution. The attitude estimation decoder based on the feature matching can realize high-precision attitude transformation output between two frames. The depth estimation method based on feature sharing can save a large amount of video memory in network training and accelerate the inference speed of the network.

As shown in fig. 1, the present disclosure adopts an architecture of an auto-supervised monocular depth estimation method based on feature sharing, and adopts a single network structure, and two networks of a depth estimation network and a pose estimation network are not required to be independent, but a pose estimation unit is integrated into a depth estimation unit by using feature sharing, so that the integration of two parts of depth estimation and pose estimation operations in a single network is realized, the network structure is simplified, the network processing speed is increased, and the real-time depth of an object detected by a monocular camera video stream is determined in real time. The single network structure comprises a data acquisition module, a feature coding module, a decoding module, a loss calculation module and a storage module.

The overall process of the feature sharing-based self-supervision type monocular depth estimation method comprises the following steps:

the first step is a data acquisition step, wherein a monocular camera deployed on an unmanned vehicle is used for data acquisition, data is acquired from a video stream, and an image frame is output;

the second step is a shared characteristic coding step, after receiving the image frame, preprocessing the image and outputting a multi-scale characteristic group through an encoder;

the third step is a decoding step, based on the shared feature coding, on one hand, the depth map is decoded, the multi-scale features are processed through a depth estimation unit, the depth map under the original resolution is output, on the other hand, the attitude is decoded, the features of the frame and the features of the previous frame are subjected to feature matching and decoding through an attitude estimation unit, and finally, the attitude transformation between the two frames is output;

the fourth step is a loss calculation step, namely combining the depth map output by the depth estimation unit with the posture transformation between two frames output by the posture estimation unit, reconstructing a target frame through operations such as projection, interpolation and the like, and further supervising the training of the network through the difference between the original target frame and the reconstructed target frame;

and the fifth step is a storage step, wherein the original image of the frame and the features output in the feature encoding step are stored in a fixed memory or a video memory for the operation of the decoding step and the loss calculation step at the next moment.

Firstly, data acquisition:

the data acquisition system of the present disclosure is composed of a monocular camera, the resolution is 720P or more, the monocular camera may be a monocular camera in the monocular acquisition system, or any one of the monocular cameras in a multi-view acquisition system including a plurality of monocular cameras. Generally, in the unmanned system, the monocular camera is arranged at the upper edge of a front windshield of an unmanned vehicle, and collects visual data in front of the vehicle in real time, so as to realize data collection of video streaming. For the monocular camera, calibration is needed to be carried out for use in subsequent loss calculation, and the calibration algorithm includes, but is not limited to, a Zhang friend calibration method, and finally distortion coefficients, internal parameters and external parameters relative to the vehicle of the monocular camera are obtained.

In the data acquisition process, sampling is carried out in real time according to a certain frequency in a video stream generated by a monocular camera according to a program to generate an image frame, and the generated image frame data is transmitted in real time for use in subsequent steps or modules. Therefore, the method and the device have the advantages that any real binocular data or any synthesized binocular data are not needed, so that the subsequent detection effect does not depend on the precision of the synthesized data, the precision of network training can be ensured only through the data acquired by the monocular camera in real time, and the accuracy of depth estimation is improved; the method does not need real binocular data to adjust the network, overcomes the defect that the traditional method can only carry out training and learning in an off-line mode, and greatly reduces the cost of system data acquisition.

Secondly, sharing feature coding:

the encoder based on the hybrid convolution mixes a depth separable convolution, an Squeeze-excitation (SE) module and a residual convolution module, and combines the hole convolution, so that the network computing efficiency is improved, the inference speed of the network and the parameter scale of the encoder are increased, and the network can perform real-time inference on embedded equipment with limited resources. The SE module implements weighting of the feature channels.

For feature coding, a deep neural network in an image classification task is used as a coder to process an image, and features of each downsampling are stored to generate a multi-scale feature set. The deep neural network comprises a ResNet deep residual network series network. The present disclosure uses a hybrid convolution module as a basic network module of a feature encoder to help the encoder to extract robust visual features quickly and efficiently, and the schematic diagram of the hybrid convolution module is shown in fig. 2.

On the basis of a residual network module applied to a ResNet series network, a depth separable convolution, a Squeeze-excitation (SE) network module and a hole convolution are introduced to form a hybrid convolution module, wherein 1x 1L-dimensional convolution is used for improving the dimension space of the feature, and then 3x3 channel-by-channel convolution and 1x1 point-by-point convolution are combined to form a depth separable convolution for improving the calculation efficiency of the feature and reducing the parameter number of the model; the SE module consisting of two times of full connection comes from the attention thought, and the enhancement of the feature expression capability is realized by re-evaluating the importance of the feature channel. In addition, by introducing the cavity convolution in the channel-by-channel convolution, the large enough receptive field is ensured in the characteristic extraction process, and the perception capability of the characteristics on the detail information is ensured. Wherein:

1) the deep separable convolution comes from a MobileNet deep neural network, and the network module is specially designed for embedded devices such as vehicle-mounted devices and mobile phones and is commonly used in a lightweight network.

The deep separable convolution decomposes the standard convolution into two steps:

performing channel-by-channel convolution, wherein each convolution kernel is only responsible for one channel, so that the number of the convolution kernels is equal to the number of channels of input features, the number of the channels of the features is kept unchanged, and information among the channels is split at the moment;

and in the second step, point-by-point convolution is carried out, the convolution kernel is reduced to 1 multiplied by 1, but the number of channels is the same as the number of input characteristic channels, which is equivalent to mixing the channel information of the characteristics to obtain the enhanced characteristics.

For the conventional convolution operation, the number of channels of the output feature map is set to be C_inThe number of output channels is C_outThe size of the convolution kernel is K_cAnd the width and height of the output feature map are W_outAnd H_outThen, the calculated amount cal (conv) and the parameter amount parm (conv) are respectively:

Cal(conv)＝K_c×K_c×C_in×W_out×H_out+C_in×C_out×W_out×H_out (1)

Parm(conv)＝K_c×K_c×C_in+C_in×C_out (2)

correspondingly, the calculated amount cal (dw) and the parameter amount parm (dw) of the depth separable convolution are respectively:

Cal(DW)＝K_c×K_c×C_in×C_out×W_out×H_out (3)

Parm(DW)＝K_c×K_c×C_out×C_in (4)

then the parameters and the quantities calculated for the depth separable convolution are each conventional convolution

It can be seen that the adoption of the deep separable convolution can effectively reduce the parameter quantity and the calculation quantity of the conventional convolution and reduce the system calculation overhead.

2) The SE module is from SE-Net (Squeeze-Excitation), and refers to an attention mechanism, and in the disclosure, the SE module is used for learning the correlation among channels in the feature map, evaluating and scoring each channel, and further selectively screening the feature channels. The SE module proceeds in two steps: firstly, compressing a feature map, and performing global average pooling on the feature map with the dimension of C multiplied by W multiplied by H to obtain the feature map with the dimension of 1 multiplied by C, wherein the feature map has a global receptive field; and secondly, carrying out feature excitation, carrying out primary channel dimension global information interaction on the 1 multiplied by C feature graph by utilizing two times of full connection operation, finally calculating the score of each channel through a Sigmoid activation function, and finally multiplying the score by the original feature to obtain the feature graph after the information channel is weighted.

3) In order to solve the problem of detail loss caused by downsampling, the encoder provided by the disclosure introduces a hole convolution, and reduces information loss on a spatial dimension on the premise of ensuring a high reception field.

The specific operation is as follows: introducing cavity convolution into a fifth layer (layer5) and a sixth layer (layer6) of the network layer, and selecting a lower (2/4) cavity rate to ensure that the skeleton (backbone) network can extract the characteristics of high receptive field and high resolution; to suppress the gridding effect, the max pooling layer is removed and two layers of normal convolution (seventh layer7 and eighth layer8) are added after the sixth layer (layer6), with the hole rates of 2 and 1, respectively, and its hopping connections are removed to obtain a smoother net output. The expression ability of the characteristics is improved.

In summary, the network structure of the hybrid convolutional encoder proposed in this disclosure is shown in table 1:

TABLE 1 encoder network

Specifically, conv2d represents a conventional two-dimensional convolution, and mix _ conv represents the hybrid convolution proposed by the present method. In the multi-scale feature output, features output by the second layer2, the third layer3, the fourth layer4, the fifth layer5 and the eighth layer8 are respectively selected to form a multi-scale feature map set for representing features under different scales. The auxiliary decoding step performs feature decoding.

Therefore, by adopting the hybrid convolution module, the feature coding can be rapidly, efficiently and precisely carried out to obtain the multi-scale shared feature group, and a foundation is provided for integrating the depth estimation unit into the attitude estimation unit.

Third, decoding

The decoding is divided into two parts, which respectively correspond to a depth estimation unit outputting a depth map and an attitude estimation unit outputting an attitude transformation.

1) The depth estimation unit provided by the present disclosure corresponds to a multi-scale feature map set, and is fused with the encoder features at the corresponding scale in the decoding process of gradually increasing the feature resolution to recover the detail information at the corresponding scale. The present disclosure adopts a fusion module based on a spatial attention mechanism to perform a feature fusion operation, and a network structure thereof is shown in fig. 3.

Let the encoder side be characterized by_EThe characteristic of the decoder side is f_DThe packed representations of features, each after a 1 × 1 convolution for obtaining a dimensionality reduction, are denoted f _ h, respectively_EAnd f _ h_DSubsequent to thatWill feature f _ h_DAnd the feature f _ h_EAnd splicing, compressing the features to 1 dimension through a layer of 3 multiplied by 3 convolution after activation, and then using a sigmoid function to output a weight distribution graph sigma, wherein the weight distribution graph sigma represents the screening of the encoder side information after the decoder side information is combined. Finally, the weight distribution map σ and the original encoder features f_EPoint-by-point multiplication and finally with decoder information f_DAnd splicing for performing deep decoding work.

2) The attitude estimation unit provided by the present disclosure performs feature matching by using correlation calculation, and the specific flow is as follows: the correlation calculation receives the feature maps f from two frames respectively₁、f₂By operation like convolution, on f₁,f₂In any one of the features x₁,x₂The correlation calculation is performed for the feature block of (2k +1) × (2k +1) as the center. To reduce the amount of computation, f is paired₁Without calculating f for any feature block in (1)₂The similarity of all the feature blocks in (1) is calculated, but only f is calculated₂The similarity of the feature blocks with the length range of up, down, left and right moving d in the corresponding position. The calculation formula is as follows:

wherein f is₁(),f₂() Respectively representing input profiles, x₁,x₂Represents the center of computation, k represents the range of computation,<·>represents a dot product operation, o represents a moving step in a local area, c (x)₁,x₂) Represents the final calculated value in x₁And x₂And performing dot multiplication operation on the feature map serving as the center. It should be noted that the calculation direction of equation (5) is unidirectional, and does not satisfy the commutative law, i.e., c (x)₁,x₂)≠c(x₂,x₁) Therefore, the unidirectionality of the attitude transformation in the attitude estimation process is ensured.

In addition, for the correlation calculation, the present disclosure also introduces a dense convolution as a decoder to decode the correlation between features. And finally, outputting the attitude transformation matrix between two frames by one layer of dense convolution and three layers of convolution.

In summary, for the encoding and decoding of the pose estimation, the network diagram is shown in fig. 4.

Wherein f1 and f2 represent the output characteristics of the two adjacent frames after being encoded by the encoding module, respectively, f3 represents the correlation diagram, and finally the posture transformation matrix R, T between the two frames is output through posture decoding.

For the pose estimation decoding, not only the feature output after the current frame is coded needs to be received, but also the feature after the previous frame is coded needs to be subjected to correlation calculation.

Fourth, loss calculation

The loss calculation module adopted by the present disclosure is divided into two parts: the first part is the reconstruction of a target frame, and the corresponding relation of coordinates in a source frame and the target frame is calculated by utilizing a depth map and a posture transformation matrix which are output by decoding, so that the target frame is reconstructed; the second part is the computation of a loss function, which the present disclosure employs the L1 loss, the structural similarity loss and the edge smoothing loss.

1) The method comprises the steps that a source frame reconstructs a target frame by utilizing a depth map output by a depth estimation unit and a pose transformation relation between the source frame and the target frame output by an attitude estimation unit, a point of a two-dimensional space is projected into a three-dimensional space by means of an image reconstruction algorithm through inverse projection operation, the point of the three-dimensional space is projected into a coordinate space of an adjacent frame by utilizing coordinate system transformation and projection operation, and the target frame is reconstructed by utilizing an image sampling module; and finally, extracting supervision information by using the pixel relation between the reconstructed target frame and the original target frame to finish the training supervision of the network algorithm. The flow chart is shown in fig. 5.

The back projection part in fig. 5 refers to a depth map D of the target frame output by the network_tAnd projecting the pixel points of the target frame into a camera coordinate system under the target frame to generate a sparse point cloud PC (personal computer)_tThe calculation formula is as follows:

PC_t(p_t)＝D_t(p_t)K^-1p_t (6)

wherein p is_tRepresenting the coordinates of any pixel point in the target frame.

PC_s＝T_t→sPC_t＝R_t→sPC_t+t_t→s (7)

wherein R is_t→sAnd t_t→sThe rotation matrix and the translation vector are both output by the attitude estimation unit.

The projection module receives a sparse point cloud PC under a source frame coordinate system_sAnd then, utilizing the internal parameter K to re-project the point cloud to a pixel coordinate system of the source frame. Obtaining corresponding point coordinates p'_sThe calculation formula is as follows:

p′_s＝KPC_s(p_t)(8)

in summary, there are two pixel mapping relations between frames:

p′_s＝KT_t→sD_t(p_t)K^-1p_t (9)

wherein the camera internal reference K is a pre-calibrated value and a pose transformation matrix T_t→sAnd depth map D_tThe obtained result is output by the attitude estimation unit and the depth estimation unit respectively.

Because the generated new coordinate values are continuous values, the sampling process utilizes a completely differentiable bilinear interpolation algorithm and utilizes pixel values of four coordinate points adjacent to the sub-coordinate points to carry out bilinear interpolation so as to obtain the corresponding relation between the source frame and the target frame pixels and generate a reconstructed target frame I'_t。

2) The loss function employed by the present disclosure includes three parts: l1 loss, Structural SIMilarity (SSIM) loss, and edge smoothing loss. The L1 loss refers to L1 distance metric loss commonly used in the field of machine learning, and takes the absolute value of the difference at pixel level between the original target frame and the reconstructed target frame as a loss function, as shown in formula (10):

wherein S represents a sequence of consecutive pictures in a training session, and generally, the modulus of S is 3 or 5, I_t(p) and I'_tAnd (p) respectively representing the pixel values of certain pixel points of the original target frame and the reconstructed target frame.

<I₁,…,I_N>Representing the sequence of images numbered in time sequence during one training, corresponding to the 1 st frame to the Nth frame respectively.

The structural similarity loss function is used for describing the similarity between an uncompressed original picture and a compressed and distorted picture, and is used for measuring the performance of a compression algorithm. In the present disclosure, the SSIM loss function constrains the quality of the reconstructed picture from three aspects of brightness, contrast and structure, and the calculation formulas are as follows:

SSIM(x,y)＝[l(x，y)]^α·[c(x,y)]^β·[s(x,y)]^γ (14)

where l (x, y) represents luminance similarity, c (x, y) represents contrast similarity, and s (x, y) represents structural similarity. x and y are coordinate values, and generally, the parameter α ═ β ═ γ ═ 1, and u is set to be a value_x,u_yAre the mean values of x, y, σ, respectively_x,σ_yAre the variance, σ, of x, y, respectively_xyIs the covariance of x and y; parameter c₁＝(k₁L)²,c₂＝(k₂L)²Which isThe middle L is the value range of the pixel value; parameter k₁＝0.01，k₂＝0.03，c₃＝0.5c₂. In summary, there may be a loss of similarity metric L between two frames_ssim(x, y) as shown in equation (15):

loss of edge smoothing: usually, a depth map output by a network has a fault feeling, that is, the depth map of the surface of an object with the same depth is not smooth, and in order to make the depth value of the surface of the object in the depth map smoother and the depth value between the objects more hierarchical, an edge smoothing loss L is introduced_smoot _hAs shown in equation (16):

wherein d is_tFor the depth value of the frame at time t,

is d_tThe average value of (a) of (b),

namely, the average normalized inverse depth information; i is_tAre pixel values.

And

Fifthly, storing

When the method for estimating the depth shared by the encoders is used for carrying out attitude estimation decoding, not only the output of the features after the current frame is encoded needs to be received, but also the correlation calculation needs to be carried out on the features after the previous frame is encoded. Therefore, after the training of the current frame is completed, the encoded features and the original picture of the current frame need to be stored in a storage medium for use in the training of the next frame. Storage media include, but are not limited to, memory and video memory. To improve efficiency, the present disclosure proposes that the space for storing features and pictures in the storage medium should be kept fixed.

In conclusion, the posture estimation network is redesigned from the aspect of feature matching, the correlation calculation module and the dense convolution module are introduced, the performance of the posture estimation network is improved, and the high-precision posture transformation output between two frames is realized on the basis of the posture estimation decoder of the feature matching; because a design method based on feature sharing is adopted, the attitude estimation unit and the depth estimation unit share the same feature encoder, a single network structure is realized, the network complexity is reduced, a large amount of video memory can be saved in network training, and the network reasoning speed is accelerated, so that the occupation of the video memory and computing resources in the training and forward reasoning of the whole network is reduced, the whole speed and performance of the network are accelerated, and online learning can be realized on equipment with low power consumption; because the encoder based on the hybrid convolution mixes the depth separable convolution, the SE module and the residual convolution module and combines the cavity convolution, the inference speed of the network is increased while the network computing efficiency is improved, the parameter scale of the encoder is reduced, the characteristic encoding can be rapidly, efficiently and accurately carried out, and the network can carry out real-time inference on embedded equipment with limited resources.

Therefore, the method disclosed by the invention reduces the network calculation amount and the video memory occupation, improves the output precision of the attitude estimation network, reduces the requirements of the network on the video memory and the calculation resources and improves the calculation efficiency; the depth estimation task of the monocular camera can be accurately finished in real time in the unmanned vehicle, the streaming video data detected by the monocular camera in real time can be processed in real time, and the depth of an object in front of the unmanned vehicle can be quickly and accurately determined in real time.

Therefore, the novel feature sharing-based self-supervision monocular depth estimation method and system provided by the disclosure are very suitable for the outdoor use background environment of the unmanned vehicle, have low system calculation cost, can be used for a low-power-consumption vehicle-mounted processor, and do not need high-cost sensor system support; the depth estimation has good real-time performance, high precision and wide application prospect.

So far, the technical solutions of the present disclosure have been described in connection with the preferred embodiments shown in the drawings, but it should be understood by those skilled in the art that the above embodiments are only for clearly illustrating the present disclosure, and not for limiting the scope of the present disclosure, and it is apparent that the scope of the present disclosure is not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the disclosure, and the technical scheme after the changes or substitutions will fall into the protection scope of the disclosure.

Claims

1. A self-supervision type monocular depth estimation method based on feature sharing is characterized in that,

the method comprises the following steps:

2. The feature sharing based self-supervised monocular depth estimation method of claim 1,

the monocular camera is deployed on an unmanned vehicle.

3. The feature sharing based self-supervised monocular depth estimation method of claim 1 or 2,

the monocular camera is deployed at the upper edge of a front windshield of the unmanned vehicle.

4. The feature sharing based auto-supervised monocular depth estimation method of any one of claims 1-3,

and reconstructing a target frame through projection and interpolation operations.

5. The feature sharing based self-supervised monocular depth estimation method of any one of claims 1-3, further comprising the steps of:

6. The feature sharing based self-supervised monocular depth estimation method of claim 1 or 2,

the monocular camera resolution is 720P or more;

7. The feature sharing based self-supervised monocular depth estimation method of claim 1 or 2,

in the first step, in a video stream generated by the monocular camera, sampling is carried out in real time according to a certain frequency to generate an image frame.

8. A readable storage medium comprising, in combination,

the readable storage medium has stored therein execution instructions for implementing the feature sharing based self-supervised monocular depth estimation method of any one of the preceding claims when executed by a processor.

9. An auto-supervised monocular depth estimation system based on feature sharing, comprising:

a memory storing a program to perform the feature sharing based auto-supervised monocular depth estimation method of any one of claims 1-8;

a processor; the processor executes the program.

10. An unmanned vehicle, comprising:

an on-board processor performing the feature sharing based self-supervised monocular depth estimation method of any one of claims 1-8.