CN113034563A - Self-supervision type monocular depth estimation method based on feature sharing - Google Patents

Self-supervision type monocular depth estimation method based on feature sharing Download PDF

Info

Publication number
CN113034563A
CN113034563A CN202110196301.6A CN202110196301A CN113034563A CN 113034563 A CN113034563 A CN 113034563A CN 202110196301 A CN202110196301 A CN 202110196301A CN 113034563 A CN113034563 A CN 113034563A
Authority
CN
China
Prior art keywords
feature
monocular
depth estimation
depth
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110196301.6A
Other languages
Chinese (zh)
Inventor
杨明
李雪
范圣印
陈禹行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Yihang Yuanzhi Intelligent Technology Co Ltd
Original Assignee
Suzhou Yihang Yuanzhi Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Yihang Yuanzhi Intelligent Technology Co Ltd filed Critical Suzhou Yihang Yuanzhi Intelligent Technology Co Ltd
Priority to CN202110196301.6A priority Critical patent/CN113034563A/en
Publication of CN113034563A publication Critical patent/CN113034563A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

A feature sharing-based self-supervision type monocular depth estimation method adopts a new single network structure, integrates an attitude estimation module into a depth estimation module, realizes the integration of two functions of depth estimation and attitude estimation, and obtains a feature sharing-based monocular depth estimation network, wherein the network comprises the following components: the device comprises a feature coding unit, a depth estimation unit, an attitude estimation unit and a supervision training unit. The attitude estimation unit realizes the real-time attitude output of the video streaming based on a characteristic matching mode, and improves the accuracy of attitude estimation; the depth estimation unit is based on an efficient coding and decoding module, so that the calculation efficiency is improved; the output of the depth estimation unit and the output of the attitude estimation unit are combined, and the supervision information is extracted from the original picture, so that the self-supervision network training process is completed; the problem of real-time acquisition of high-precision monocular depth information in unmanned driving is effectively solved.

Description

Self-supervision type monocular depth estimation method based on feature sharing
Technical Field
The disclosure relates to the technical field of depth perception and computer vision in the unmanned driving industry, in particular to a feature sharing-based self-supervision monocular depth estimation method in an unmanned driving scene, and particularly relates to a feature sharing-based self-supervision monocular depth estimation method implemented by a single network structure.
Background
With the continuous development of computer vision technology, the three-dimensional scene perception task plays a crucial role in the unmanned industry. The three-dimensional perception task is different from two-dimensional perception and is mainly embodied in perception detection of information such as pedestrians, vehicles and obstacles around the unmanned vehicle in a real three-dimensional space. The information acquired by three-dimensional perception is a key basis for the unmanned vehicle to make vehicle motion decisions, and the depth information is a basis for a three-dimensional scene perception task. The three-dimensional sensing system needs to acquire depth information of a scene acquired by a monocular camera in a monocular acquisition system only comprising one camera or any one monocular camera in a multi-view acquisition system comprising a plurality of monocular cameras in view of the limitation of the installation layout of actual unmanned vehicle sensor cameras and the requirement of accurately detecting information such as pedestrians, vehicles, obstacles and the like around an unmanned vehicle in a dynamic change situation, so that the problem of acquiring the depth information of a monocular camera application situation or an effective scene information only from one monocular camera is solved. In fact, the above situation occurs sometimes during the dynamic operation of the unmanned vehicle, and therefore, the acquisition of depth information by using a monocular camera is a key research content in the three-dimensional scene perception task.
Currently, for a monocular depth estimation task, the existing research can be mainly divided into a supervised type direction and an unsupervised type direction, wherein a supervised monocular depth estimation network generally has excellent performance on a specific data set, but the problems of multiple network model parameters, difficulty in obtaining labeled data and the like exist; the unsupervised monocular depth estimation network can be flexibly applied to different data sets, but the problems of poor network model precision, unreasonable network training strategy and the like exist.
To understand the state of the art, the present disclosure searches, compares and analyzes existing patents and papers:
patent document CN 108961327 a, "monocular depth estimation method and apparatus, device, and storage medium" proposes a method for performing monocular depth estimation network training using binocular information, which obtains a monocular depth estimation network by using synthesis and real binocular sample data training, respectively. The method uses a small amount of real binocular data and a large amount of synthetic binocular data, and the effect is relatively dependent on the precision of the synthetic data; meanwhile, the method needs real binocular data to adjust the network, so that the method can only perform training and learning in an off-line mode, and the cost of data acquisition is increased.
In patent document CN 111680554 a, "depth estimation method and apparatus for automatic driving scene, and autonomous vehicle", a cascade network manner is proposed to optimize and complement the result of monocular depth estimation, first, a depth estimation model is used to generate basic depth estimation information, and then, a deviation estimation network is used to output deviation estimation information of a target region, thereby solving the problem of insufficient depth estimation accuracy of the target region. In order to improve the accuracy of monocular depth estimation on a target area, the method introduces a target detection method and a cascade network, the overall scale of the network is enlarged, the network structure is complex, the system cost is high, the calculation consumption is large, and the neural network method is difficult to operate in real time.
In patent document CN 110599533 a, "fast monocular depth estimation method applicable to embedded platform", a lightweight monocular depth estimation method on an embedded platform is proposed, in which a lightweight depth estimation network is deployed on the embedded platform, a model training framework is configured on an edge server, and the two interact through the network: the embedded platform provides data and labels for the edge server, and the edge server trains and updates the server on the embedded platform after obtaining the data. The method provides a method for carrying out depth estimation network deployment on an embedded platform, but the method adopts an RGB-D camera to collect monocular pictures and depth maps, is limited by the limitation of the RGB-D camera, has limited depth map perception range, is limited in indoor scenes, is generally applied to indoor robot motion occasions, and is not suitable for outdoor vehicle-side occasions.
It can be seen that, in unmanned driving, the existing monocular depth estimation method cannot meet the requirements of unmanned driving on detection precision, stability and real-time performance no matter on network model precision or training strategy, and is difficult to obtain satisfactory comprehensive effect on a low-power-consumption vehicle-mounted processor. Therefore, a new monocular depth estimation method needs to be researched, the monocular depth estimation precision can be ensured, the requirement of an outdoor unmanned vehicle end can be met, extra calculation overhead is not increased, the method can be used for a low-power-consumption vehicle-mounted processor, and meanwhile, the complex and high-cost sensor system support is not needed.
Disclosure of Invention
In order to adapt to an unmanned application occasion and carry out real-time, efficient, accurate and reliable depth estimation aiming at a monocular camera to acquire a video image so as to effectively acquire depth information in three-dimensional scene perception, the disclosure provides a novel self-supervision monocular depth estimation method. The method adopts a single network structure, does not need two networks of an independent depth estimation network and an independent attitude estimation network, and utilizes feature sharing to integrate an attitude estimation unit into the depth estimation unit, so that the integration of two parts of depth estimation and attitude estimation in a single network is realized, and a brand-new monocular and monocular source depth estimation network based on feature sharing is obtained, so that the network structure is simplified, the network processing speed is accelerated, and the real-time depth of an object detected by a monocular camera video stream is determined in real time.
The monocular source depth estimation network includes: the device comprises a feature coding unit, a depth estimation unit, an attitude estimation unit and a supervision training unit.
The attitude estimation unit realizes real-time attitude output of video streaming based on a characteristic matching mode, and improves the accuracy of attitude estimation.
The depth estimation unit is based on an efficient coding and decoding module, wherein a depth separable convolution, an SE module and a residual convolution module are mixed based on a mixed convolution encoder, and the hollow convolution is combined, so that the calculation efficiency is improved, and high-precision depth output is realized.
And the output of the depth estimation unit and the output of the attitude estimation unit are combined, the supervision information is provided from the original picture, and the self-supervision network training process is realized. The present frame feature and the image frame are obtained.
Therefore, the self-supervision monocular depth estimation method based on feature sharing is obtained.
The method reduces the network calculation amount and the video memory occupation, improves the output precision of the attitude estimation network, reduces the requirements of the network on the video memory and the calculation resources, and improves the calculation efficiency; and a high-efficiency characteristic coding module and a high-efficiency decoding module are designed, so that the output precision of the network is improved while the calculated amount and the parameter amount are reduced, and the real-time performance of a depth estimation task is ensured.
Specifically, to solve the above technical problem, according to an aspect of the present invention, there is provided an auto-supervised monocular depth estimation method based on feature sharing, wherein:
a single network structure is adopted, the attitude estimation unit is integrated into the depth estimation unit, the integration of the operations of the depth estimation and the attitude estimation is realized, and a monocular single-source depth estimation network based on feature sharing is obtained; the monocular source depth estimation network includes: the system comprises a shared feature coding unit, a depth estimation unit, an attitude estimation unit and a supervision training unit;
the method comprises the following steps:
acquiring data, namely acquiring data from a video stream through a monocular camera and outputting an image frame;
step two, sharing feature coding, namely, after receiving the image frame, preprocessing the image frame, and outputting a multi-scale sharing feature group through an encoder;
decoding, namely receiving the multi-scale shared feature group, processing the multi-scale features through a depth estimation unit, and outputting a depth map under the original resolution; performing feature matching and decoding on the features of the current frame and the features of the previous frame through a pose estimation unit, and outputting pose transformation between the two frames;
and step four, loss calculation, namely combining the depth map output by the depth estimation unit with the posture transformation between the two frames output by the posture estimation unit, reconstructing the target frame, and further supervising the training of the network through the difference between the original target frame and the reconstructed target frame.
Preferably, the monocular camera is deployed on an unmanned vehicle.
Preferably, the monocular camera is disposed at an upper edge of a front windshield of the unmanned vehicle.
Preferably, the target frame is reconstructed by projection and interpolation operations.
Preferably, the method further comprises the following steps:
and step five, storing, namely storing the original image of the frame and the features output in the feature encoding step in a storage medium for the operation of the decoding step and the loss calculating step at the next moment.
Preferably, the monocular camera resolution is 720P or greater;
the monocular camera is a monocular camera in a monocular acquisition system, or any one of monocular cameras in a multi-view acquisition system comprising a plurality of monocular cameras.
Preferably, in the first step, in the video stream generated by the monocular camera, the sampling is performed in real time according to a certain frequency to generate the image frame.
Preferably, in encoding the shared features, a hybrid convolution encoder is employed that blends the depth separable convolution, SE module (i.e., compress and activate the Squeeze-Excitation module) with the residual convolution module and combines with the hole convolution.
Preferably, the image is processed by using a deep neural network, and the features of each downsampling are stored to generate a multi-scale feature set.
Preferably, the deep neural network comprises a deep residual network (ResNet) series network.
Preferably, in the process of the mixed convolution, 1 × 1L-dimensional convolution is adopted to improve the dimension space of the features, and then 3 × 3 channel-by-channel convolution and 1 × 1 point-by-point convolution are combined into a depth separable convolution, so that the calculation efficiency of the features is improved, and the parameter quantity of the model is reduced; the SE module is formed by two times of full connection, and the feature expression capacity is enhanced by re-evaluating the importance of the feature channel; the receptive field in the feature extraction process is ensured by introducing a hole convolution in the channel-by-channel convolution.
Preferably, the depth separable convolution decomposes the standard convolution into two steps:
firstly, performing channel-by-channel convolution, wherein each convolution kernel is only responsible for one channel;
secondly, performing point-by-point convolution, reducing a convolution kernel to 1 multiplied by 1, wherein the number of channels is the same as the number of input characteristic channels, and mixing the channel information of the characteristics to obtain enhanced characteristics;
the calculated amount cal (dw) and the parameter amount parm (dw) of the depth separable convolution are respectively:
Cal(DW)=Kc×Kc×Cin×Cout×Wout×Hout (3)
Parm(DW)=Kc×Kc×Cout×Cin (4)
wherein, CinNumber of channels being input feature map, CoutNumber of channels, K, of the output profilecIs the size of the convolution kernel, WoutAnd HoutRespectively, the width and height of the output signature.
Preferably, the SE module is adopted to learn the correlation among the channels in the feature map, each channel is evaluated and scored, and then the feature channels are selectively screened.
Preferably, the operation of the SE module comprises:
firstly, compressing a feature map, and performing global average pooling on the feature map with the dimension of C multiplied by W multiplied by H to obtain the feature map with the dimension of 1 multiplied by C, wherein the feature map has a global receptive field;
and secondly, carrying out feature excitation, carrying out primary channel dimension global information interaction on the 1 multiplied by C feature graph by utilizing two times of full connection operation, finally calculating the score of each channel through a Sigmoid activation function, and finally multiplying the score by the original feature to obtain the feature graph after the information channel is weighted.
Preferably, hole convolution is introduced into the fifth layer (layer5) and the sixth layer (layer6) of the network layer, and the low hole rate is selected to ensure that the skeleton (backbone) network extracts the characteristics of high receptive field and high resolution;
the max pooling layer is removed and two layers of normal convolution are added after the sixth layer (layer6), the seventh layer (layer7) and the eighth layer (layer8), with hole rates of 2 and 1, respectively, and its hopping connection is removed to obtain a smoother net output.
Preferably, in the multi-scale feature output, the features output by the second layer (layer2), the third layer (layer3), the fourth layer (layer4), the fifth layer (layer5) and the eighth layer (layer8) are respectively selected to form a multi-scale feature map set to represent the features under different scales, and the multi-scale feature map set is used for feature decoding in a decoding step.
Preferably, the depth estimation unit performs a feature fusion operation using a fusion module based on a spatial attention mechanism, fEAs a feature of the encoder side, fDFor the features at the decoder side, the f isEAnd fDThe packed features, denoted f _ h respectively, which are subjected to a 1 × 1 convolution in each case for obtaining a reduction in dimensionalityEAnd f _ hDThen the feature f _ hDAnd the feature f _ hESplicing, compressing the features to 1 dimension through a layer of 3 multiplied by 3 convolution after activation, and then connecting a Sigmoid function to output a weight distribution graph sigma, wherein the weight distribution graph sigma represents the screening of the encoder end information after being combined with the decoder end information; then, the weight distribution graph σ and the original encoder feature fEPoint-by-point multiplication and finally with decoder information fDAnd splicing for deep decoding.
Preferably, the attitude estimation unit adoptsMatching of features is performed by correlation calculation which accepts feature maps f from two frames separately1、f2To f for1、f2In any one of the features x1,x2Performing correlation calculation on the feature blocks of (2k +1) × (2k +1) as the centers; wherein, for f1Without calculating f for any feature block in (1)2The similarity of all the feature blocks in (1) is calculated, but only f is calculated2Similarity of feature blocks in a length range of up, down, left and right, and d with the corresponding positions; the calculation is shown in equation (5):
Figure BDA0002946807530000061
wherein f is1(),f2() Respectively representing input profiles, x1,x2Represents the center of computation, k represents the range of computation,<·>represents a dot product operation, o represents a moving step in a local area, c (x)1,x2Represents the final calculated value in x1And x2The feature map point which is taken as the center is multiplied by the operation result; wherein the calculation direction of the formula (5) is unidirectional, not satisfying the commutative law, i.e. c (x)1,x2)≠c(x2,x1) Therefore, the unidirectionality of the attitude transformation in the attitude estimation process is ensured.
Preferably, dense convolution is adopted as a decoder to decode the correlation between the features, and finally the output of the attitude transformation matrix between two frames is realized through one layer of dense convolution and three layers of convolution.
Preferably, the loss calculation includes:
reconstruction of the target frame: calculating the corresponding relation of coordinates in the source frame and the target frame by using the depth map and the attitude transformation matrix which are output by decoding, and further reconstructing the target frame;
calculation of the loss function: the L1 loss, the structural similarity loss and the edge smoothing loss are used as loss functions.
Preferably, the source frame reconstructs the target frame by using the depth map output by the depth estimation unit and the pose transformation relation between the source frame and the target frame output by the pose estimation unit, the point of the two-dimensional space is projected into the three-dimensional space through the back projection operation, the point of the three-dimensional space is projected into the coordinate space of the adjacent frame by using the coordinate system transformation and projection operation, and the target frame is reconstructed by using the image sampling module; and finally, extracting supervision information by using the pixel relation between the reconstructed target frame and the original target frame, and training and supervising the network.
Preferably, the depth map D of the target frame output by the network is utilizedtAnd projecting the pixel points of the target frame into a camera coordinate system under the target frame to generate a sparse point cloud PC (personal computer)tThe calculation formula is as follows:
PCt(pt)=Dt(pt)K-1pt (6)
wherein p istRepresenting the coordinates of any pixel point in the target frame;
the coordinate system conversion step utilizes a pose transformation matrix T of the source frame and the target frame output by the pose estimation unitt→sPC (personal computer) for sparse point cloudtTransforming to the source frame coordinate system to obtain point cloud PCsThe calculation formula is as follows:
PCs=Tt→sPCt=Rt→sPCt+tt→s (7)
wherein R ist→sAnd tt→sThe rotation matrix and the translation vector are both output by the attitude estimation unit;
the projection module receives a sparse point cloud PC under a source frame coordinate systemsThen, the point cloud is re-projected to a pixel coordinate system of the source frame by using the internal reference K to obtain corresponding point coordinates p'sThe calculation formula is as follows:
p′s=KPCs(pt) (8)
obtaining the corresponding relation of the pixels between two frames as follows:
p′s=KTt→sDt(pt)K-1pt (9)
wherein the camera internal reference K is a pre-markConstant value, position and posture transformation matrix Tt→sAnd depth map DtThe obtained result is output by the attitude estimation unit and the depth estimation unit respectively.
Preferably, the loss function comprises: l1 loss, Structural SIMilarity (SSIM) loss, and edge smoothing loss.
Preferably, the L1 loss takes the absolute value of the difference in pixel level between the original target frame and the reconstructed target frame as a loss function, as shown in equation (10):
Figure BDA0002946807530000071
wherein S represents a sequence of consecutive pictures in a training session, It(p) and I't(p) respectively representing the pixel values of a certain pixel point of the original target frame and the reconstructed target frame;<I1,…,IN>representing the sequence of images numbered in time sequence during one training, corresponding to the 1 st frame to the Nth frame respectively.
Preferably, the structural similarity loss function describes the similarity between an uncompressed original picture and a compressed distorted picture, and the structural similarity loss function constrains the quality of a reconstructed picture in terms of brightness, contrast, and structure, which is calculated as follows:
Figure BDA0002946807530000081
Figure BDA0002946807530000082
Figure BDA0002946807530000083
SSIM(x,y)=[l(x,y)α·[c(x,y)]β·[s(x,y)]γ (14)
wherein l (x, y) represents lightDegree similarity, c (x, y) denotes contrast similarity, s (x, y) denotes structural similarity; x and y are coordinate values, and the parameter alpha, beta, gamma, 1, ux,uyAre the mean values of x, y, σ, respectivelyxyAre the variance, σ, of x, y, respectivelyxyIs the covariance of x and y; parameter c1=(k1L)2,c2=(k2L)2Wherein L is the value range of the pixel value; parameter k1=0.01,k2=0.03,c3=0.5c2
Deriving a similarity metric loss L between two framesssim(x, y) is shown in equation (15):
Figure BDA0002946807530000084
preferably, the edge smoothing loss LsmoothAs shown in equation (16):
Figure BDA0002946807530000085
wherein d istFor the depth value of the frame at time t,
Figure BDA0002946807530000086
is dtThe average value of (a) of (b),
Figure BDA0002946807530000087
namely, the average normalized inverse depth information; i istIs the pixel value;
Figure BDA0002946807530000088
and
Figure BDA0002946807530000089
representing the differential operation in the x and y dimensions, respectively.
Preferably, the modulus of S takes 3 or 5.
Preferably, the storage medium includes a fixed memory or a video memory, and a space for storing the features and the pictures in the storage medium is kept fixed.
To solve the above technical problem, according to another aspect of the present invention, there is provided a readable storage medium, wherein:
the readable storage medium stores executable instructions, and the executable instructions are executed by a processor to implement the above-mentioned feature sharing-based self-supervision monocular depth estimation method.
To solve the above technical problem, according to still another aspect of the present invention, there is provided an auto-supervised monocular depth estimation system based on feature sharing, including:
a memory storing a program for executing the above feature sharing-based self-supervised monocular depth estimation method;
a processor; the processor executes the program.
To solve the above technical problem, according to still another aspect of the present invention, there is provided an unmanned vehicle including:
an onboard processor that executes the above feature sharing based method of auto-supervised monocular depth estimation.
The beneficial effect of this disclosure:
1. the hybrid convolution module is provided, and can quickly, efficiently and accurately carry out feature coding;
2. the attitude estimation decoder based on feature matching is provided, and high-precision attitude transformation output between two frames can be realized;
3. the deep estimation method for feature sharing is provided, a large amount of video memory can be saved in network training, the inference speed of the network is increased, and online learning can be realized on equipment with low power consumption.
4. Real-time and accurate depth estimation of the monocular camera is realized;
5. the simplified single network structure improves the calculation efficiency and simultaneously ensures the accuracy of the system;
drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure. The above and other objects, features, and advantages of the present disclosure will become more apparent from the detailed description of the embodiments of the present disclosure when taken in conjunction with the accompanying drawings.
FIG. 1 is an overall flow diagram;
FIG. 2 is a schematic diagram of a hybrid convolution module;
FIG. 3 is a feature fusion module based on a spatial attention mechanism;
FIG. 4 is a schematic diagram of an attitude estimation network;
fig. 5 is a schematic diagram of target frame reconstruction.
Detailed Description
The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. Unless otherwise indicated, the illustrated exemplary embodiments/examples are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Accordingly, unless otherwise indicated, features of the various embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concept of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising" and variations thereof are used in this specification, the presence of stated features, integers, steps, operations, elements, components and/or groups thereof are stated but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximate terms and not as degree terms, and as such, are used to interpret inherent deviations in measured values, calculated values, and/or provided values that would be recognized by one of ordinary skill in the art.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
It is an object of the present disclosure to provide a new method of self-supervised monocular depth estimation.
In order to reduce the data acquisition cost and adapt to an online training learning mode, binocular data are not used, namely real binocular data or synthesized binocular data are not needed, monocular depth estimation is carried out only by means of data acquired by a monocular camera, wherein the monocular camera can be a monocular camera in a monocular acquisition system or any one of monocular cameras in a multi-ocular acquisition system comprising a plurality of monocular cameras.
In order to adapt to the processing capacity of a low-power-consumption vehicle-mounted processor and simultaneously be applied to unmanned driving which cannot provide support of a complex and high-cost sensor system, the method redesigns a network structure and adjusts a training strategy, adopts a new single network structure to realize fusion of two parts of depth estimation and attitude estimation, and fuses the attitude estimation into the depth estimation, so that the network structure is simplified, and the calculation overhead of the system is reduced in the algorithm principle.
The problem of complex network structure in the prior art that the depth information and the attitude information are respectively obtained by respectively adopting the depth estimation network and the attitude estimation network is solved in principle by adopting a single network structure, the occupation of video memory and computing resources is less, the system has good real-time performance, and the method is very suitable for being used on a low-power-consumption vehicle-mounted processor. The overall speed and performance of the network are also improved.
By adopting the single network structure, a brand-new monocular and unisource depth estimation network based on feature sharing is obtained. The monocular source depth estimation network includes: the device comprises a feature coding unit, a depth estimation unit, an attitude estimation unit and a supervision training unit.
In order to improve the precision of attitude estimation and the real-time processing capacity of a system, a monocular camera is adopted to collect data from a video stream, and the real-time attitude output of a video stream type is realized by adopting a characteristic matching-based mode, namely the data processed by the network disclosed by the invention is the real-time video stream type data instead of static image data, so that the real-time performance and the accuracy of output information are ensured, and the precision of system depth estimation is improved. The single network structure processes the video streaming data in real time to quickly and accurately determine the depth of an object in front of the unmanned vehicle in real time.
In order to realize high-precision depth output, a high-efficiency coding and decoding module is designed, wherein a depth separable convolution, an SE module and a residual convolution module are mixed based on a mixed convolution encoder, and the calculation efficiency is greatly improved by combining with the hole convolution. The attitude estimation decoder based on the feature matching can realize high-precision attitude transformation output between two frames. The depth estimation method based on feature sharing can save a large amount of video memory in network training and accelerate the inference speed of the network.
As shown in fig. 1, the present disclosure adopts an architecture of an auto-supervised monocular depth estimation method based on feature sharing, and adopts a single network structure, and two networks of a depth estimation network and a pose estimation network are not required to be independent, but a pose estimation unit is integrated into a depth estimation unit by using feature sharing, so that the integration of two parts of depth estimation and pose estimation operations in a single network is realized, the network structure is simplified, the network processing speed is increased, and the real-time depth of an object detected by a monocular camera video stream is determined in real time. The single network structure comprises a data acquisition module, a feature coding module, a decoding module, a loss calculation module and a storage module.
The overall process of the feature sharing-based self-supervision type monocular depth estimation method comprises the following steps:
the first step is a data acquisition step, wherein a monocular camera deployed on an unmanned vehicle is used for data acquisition, data is acquired from a video stream, and an image frame is output;
the second step is a shared characteristic coding step, after receiving the image frame, preprocessing the image and outputting a multi-scale characteristic group through an encoder;
the third step is a decoding step, based on the shared feature coding, on one hand, the depth map is decoded, the multi-scale features are processed through a depth estimation unit, the depth map under the original resolution is output, on the other hand, the attitude is decoded, the features of the frame and the features of the previous frame are subjected to feature matching and decoding through an attitude estimation unit, and finally, the attitude transformation between the two frames is output;
the fourth step is a loss calculation step, namely combining the depth map output by the depth estimation unit with the posture transformation between two frames output by the posture estimation unit, reconstructing a target frame through operations such as projection, interpolation and the like, and further supervising the training of the network through the difference between the original target frame and the reconstructed target frame;
and the fifth step is a storage step, wherein the original image of the frame and the features output in the feature encoding step are stored in a fixed memory or a video memory for the operation of the decoding step and the loss calculation step at the next moment.
Firstly, data acquisition:
the data acquisition system of the present disclosure is composed of a monocular camera, the resolution is 720P or more, the monocular camera may be a monocular camera in the monocular acquisition system, or any one of the monocular cameras in a multi-view acquisition system including a plurality of monocular cameras. Generally, in the unmanned system, the monocular camera is arranged at the upper edge of a front windshield of an unmanned vehicle, and collects visual data in front of the vehicle in real time, so as to realize data collection of video streaming. For the monocular camera, calibration is needed to be carried out for use in subsequent loss calculation, and the calibration algorithm includes, but is not limited to, a Zhang friend calibration method, and finally distortion coefficients, internal parameters and external parameters relative to the vehicle of the monocular camera are obtained.
In the data acquisition process, sampling is carried out in real time according to a certain frequency in a video stream generated by a monocular camera according to a program to generate an image frame, and the generated image frame data is transmitted in real time for use in subsequent steps or modules. Therefore, the method and the device have the advantages that any real binocular data or any synthesized binocular data are not needed, so that the subsequent detection effect does not depend on the precision of the synthesized data, the precision of network training can be ensured only through the data acquired by the monocular camera in real time, and the accuracy of depth estimation is improved; the method does not need real binocular data to adjust the network, overcomes the defect that the traditional method can only carry out training and learning in an off-line mode, and greatly reduces the cost of system data acquisition.
Secondly, sharing feature coding:
the encoder based on the hybrid convolution mixes a depth separable convolution, an Squeeze-excitation (SE) module and a residual convolution module, and combines the hole convolution, so that the network computing efficiency is improved, the inference speed of the network and the parameter scale of the encoder are increased, and the network can perform real-time inference on embedded equipment with limited resources. The SE module implements weighting of the feature channels.
For feature coding, a deep neural network in an image classification task is used as a coder to process an image, and features of each downsampling are stored to generate a multi-scale feature set. The deep neural network comprises a ResNet deep residual network series network. The present disclosure uses a hybrid convolution module as a basic network module of a feature encoder to help the encoder to extract robust visual features quickly and efficiently, and the schematic diagram of the hybrid convolution module is shown in fig. 2.
On the basis of a residual network module applied to a ResNet series network, a depth separable convolution, a Squeeze-excitation (SE) network module and a hole convolution are introduced to form a hybrid convolution module, wherein 1x 1L-dimensional convolution is used for improving the dimension space of the feature, and then 3x3 channel-by-channel convolution and 1x1 point-by-point convolution are combined to form a depth separable convolution for improving the calculation efficiency of the feature and reducing the parameter number of the model; the SE module consisting of two times of full connection comes from the attention thought, and the enhancement of the feature expression capability is realized by re-evaluating the importance of the feature channel. In addition, by introducing the cavity convolution in the channel-by-channel convolution, the large enough receptive field is ensured in the characteristic extraction process, and the perception capability of the characteristics on the detail information is ensured. Wherein:
1) the deep separable convolution comes from a MobileNet deep neural network, and the network module is specially designed for embedded devices such as vehicle-mounted devices and mobile phones and is commonly used in a lightweight network.
The deep separable convolution decomposes the standard convolution into two steps:
performing channel-by-channel convolution, wherein each convolution kernel is only responsible for one channel, so that the number of the convolution kernels is equal to the number of channels of input features, the number of the channels of the features is kept unchanged, and information among the channels is split at the moment;
and in the second step, point-by-point convolution is carried out, the convolution kernel is reduced to 1 multiplied by 1, but the number of channels is the same as the number of input characteristic channels, which is equivalent to mixing the channel information of the characteristics to obtain the enhanced characteristics.
For the conventional convolution operation, the number of channels of the output feature map is set to be CinThe number of output channels is CoutThe size of the convolution kernel is KcAnd the width and height of the output feature map are WoutAnd HoutThen, the calculated amount cal (conv) and the parameter amount parm (conv) are respectively:
Cal(conv)=Kc×Kc×Cin×Wout×Hout+Cin×Cout×Wout×Hout (1)
Parm(conv)=Kc×Kc×Cin+Cin×Cout (2)
correspondingly, the calculated amount cal (dw) and the parameter amount parm (dw) of the depth separable convolution are respectively:
Cal(DW)=Kc×Kc×Cin×Cout×Wout×Hout (3)
Parm(DW)=Kc×Kc×Cout×Cin (4)
then the parameters and the quantities calculated for the depth separable convolution are each conventional convolution
Figure BDA0002946807530000141
It can be seen that the adoption of the deep separable convolution can effectively reduce the parameter quantity and the calculation quantity of the conventional convolution and reduce the system calculation overhead.
2) The SE module is from SE-Net (Squeeze-Excitation), and refers to an attention mechanism, and in the disclosure, the SE module is used for learning the correlation among channels in the feature map, evaluating and scoring each channel, and further selectively screening the feature channels. The SE module proceeds in two steps: firstly, compressing a feature map, and performing global average pooling on the feature map with the dimension of C multiplied by W multiplied by H to obtain the feature map with the dimension of 1 multiplied by C, wherein the feature map has a global receptive field; and secondly, carrying out feature excitation, carrying out primary channel dimension global information interaction on the 1 multiplied by C feature graph by utilizing two times of full connection operation, finally calculating the score of each channel through a Sigmoid activation function, and finally multiplying the score by the original feature to obtain the feature graph after the information channel is weighted.
3) In order to solve the problem of detail loss caused by downsampling, the encoder provided by the disclosure introduces a hole convolution, and reduces information loss on a spatial dimension on the premise of ensuring a high reception field.
The specific operation is as follows: introducing cavity convolution into a fifth layer (layer5) and a sixth layer (layer6) of the network layer, and selecting a lower (2/4) cavity rate to ensure that the skeleton (backbone) network can extract the characteristics of high receptive field and high resolution; to suppress the gridding effect, the max pooling layer is removed and two layers of normal convolution (seventh layer7 and eighth layer8) are added after the sixth layer (layer6), with the hole rates of 2 and 1, respectively, and its hopping connections are removed to obtain a smoother net output. The expression ability of the characteristics is improved.
In summary, the network structure of the hybrid convolutional encoder proposed in this disclosure is shown in table 1:
TABLE 1 encoder network
Figure BDA0002946807530000151
Specifically, conv2d represents a conventional two-dimensional convolution, and mix _ conv represents the hybrid convolution proposed by the present method. In the multi-scale feature output, features output by the second layer2, the third layer3, the fourth layer4, the fifth layer5 and the eighth layer8 are respectively selected to form a multi-scale feature map set for representing features under different scales. The auxiliary decoding step performs feature decoding.
Therefore, by adopting the hybrid convolution module, the feature coding can be rapidly, efficiently and precisely carried out to obtain the multi-scale shared feature group, and a foundation is provided for integrating the depth estimation unit into the attitude estimation unit.
Third, decoding
The decoding is divided into two parts, which respectively correspond to a depth estimation unit outputting a depth map and an attitude estimation unit outputting an attitude transformation.
1) The depth estimation unit provided by the present disclosure corresponds to a multi-scale feature map set, and is fused with the encoder features at the corresponding scale in the decoding process of gradually increasing the feature resolution to recover the detail information at the corresponding scale. The present disclosure adopts a fusion module based on a spatial attention mechanism to perform a feature fusion operation, and a network structure thereof is shown in fig. 3.
Let the encoder side be characterized byEThe characteristic of the decoder side is fDThe packed representations of features, each after a 1 × 1 convolution for obtaining a dimensionality reduction, are denoted f _ h, respectivelyEAnd f _ hDSubsequent to thatWill feature f _ hDAnd the feature f _ hEAnd splicing, compressing the features to 1 dimension through a layer of 3 multiplied by 3 convolution after activation, and then using a sigmoid function to output a weight distribution graph sigma, wherein the weight distribution graph sigma represents the screening of the encoder side information after the decoder side information is combined. Finally, the weight distribution map σ and the original encoder features fEPoint-by-point multiplication and finally with decoder information fDAnd splicing for performing deep decoding work.
2) The attitude estimation unit provided by the present disclosure performs feature matching by using correlation calculation, and the specific flow is as follows: the correlation calculation receives the feature maps f from two frames respectively1、f2By operation like convolution, on f1,f2In any one of the features x1,x2The correlation calculation is performed for the feature block of (2k +1) × (2k +1) as the center. To reduce the amount of computation, f is paired1Without calculating f for any feature block in (1)2The similarity of all the feature blocks in (1) is calculated, but only f is calculated2The similarity of the feature blocks with the length range of up, down, left and right moving d in the corresponding position. The calculation formula is as follows:
Figure BDA0002946807530000161
wherein f is1(),f2() Respectively representing input profiles, x1,x2Represents the center of computation, k represents the range of computation,<·>represents a dot product operation, o represents a moving step in a local area, c (x)1,x2) Represents the final calculated value in x1And x2And performing dot multiplication operation on the feature map serving as the center. It should be noted that the calculation direction of equation (5) is unidirectional, and does not satisfy the commutative law, i.e., c (x)1,x2)≠c(x2,x1) Therefore, the unidirectionality of the attitude transformation in the attitude estimation process is ensured.
In addition, for the correlation calculation, the present disclosure also introduces a dense convolution as a decoder to decode the correlation between features. And finally, outputting the attitude transformation matrix between two frames by one layer of dense convolution and three layers of convolution.
In summary, for the encoding and decoding of the pose estimation, the network diagram is shown in fig. 4.
Wherein f1 and f2 represent the output characteristics of the two adjacent frames after being encoded by the encoding module, respectively, f3 represents the correlation diagram, and finally the posture transformation matrix R, T between the two frames is output through posture decoding.
For the pose estimation decoding, not only the feature output after the current frame is coded needs to be received, but also the feature after the previous frame is coded needs to be subjected to correlation calculation.
Fourth, loss calculation
The loss calculation module adopted by the present disclosure is divided into two parts: the first part is the reconstruction of a target frame, and the corresponding relation of coordinates in a source frame and the target frame is calculated by utilizing a depth map and a posture transformation matrix which are output by decoding, so that the target frame is reconstructed; the second part is the computation of a loss function, which the present disclosure employs the L1 loss, the structural similarity loss and the edge smoothing loss.
1) The method comprises the steps that a source frame reconstructs a target frame by utilizing a depth map output by a depth estimation unit and a pose transformation relation between the source frame and the target frame output by an attitude estimation unit, a point of a two-dimensional space is projected into a three-dimensional space by means of an image reconstruction algorithm through inverse projection operation, the point of the three-dimensional space is projected into a coordinate space of an adjacent frame by utilizing coordinate system transformation and projection operation, and the target frame is reconstructed by utilizing an image sampling module; and finally, extracting supervision information by using the pixel relation between the reconstructed target frame and the original target frame to finish the training supervision of the network algorithm. The flow chart is shown in fig. 5.
The back projection part in fig. 5 refers to a depth map D of the target frame output by the networktAnd projecting the pixel points of the target frame into a camera coordinate system under the target frame to generate a sparse point cloud PC (personal computer)tThe calculation formula is as follows:
PCt(pt)=Dt(pt)K-1pt (6)
wherein p istRepresenting the coordinates of any pixel point in the target frame.
The coordinate system conversion step utilizes a pose transformation matrix T of the source frame and the target frame output by the pose estimation unitt→sPC (personal computer) for sparse point cloudtTransforming to the source frame coordinate system to obtain point cloud PCsThe calculation formula is as follows:
PCs=Tt→sPCt=Rt→sPCt+tt→s (7)
wherein R ist→sAnd tt→sThe rotation matrix and the translation vector are both output by the attitude estimation unit.
The projection module receives a sparse point cloud PC under a source frame coordinate systemsAnd then, utilizing the internal parameter K to re-project the point cloud to a pixel coordinate system of the source frame. Obtaining corresponding point coordinates p'sThe calculation formula is as follows:
p′s=KPCs(pt)(8)
in summary, there are two pixel mapping relations between frames:
p′s=KTt→sDt(pt)K-1pt (9)
wherein the camera internal reference K is a pre-calibrated value and a pose transformation matrix Tt→sAnd depth map DtThe obtained result is output by the attitude estimation unit and the depth estimation unit respectively.
Because the generated new coordinate values are continuous values, the sampling process utilizes a completely differentiable bilinear interpolation algorithm and utilizes pixel values of four coordinate points adjacent to the sub-coordinate points to carry out bilinear interpolation so as to obtain the corresponding relation between the source frame and the target frame pixels and generate a reconstructed target frame I't
2) The loss function employed by the present disclosure includes three parts: l1 loss, Structural SIMilarity (SSIM) loss, and edge smoothing loss. The L1 loss refers to L1 distance metric loss commonly used in the field of machine learning, and takes the absolute value of the difference at pixel level between the original target frame and the reconstructed target frame as a loss function, as shown in formula (10):
Figure BDA0002946807530000181
wherein S represents a sequence of consecutive pictures in a training session, and generally, the modulus of S is 3 or 5, It(p) and I'tAnd (p) respectively representing the pixel values of certain pixel points of the original target frame and the reconstructed target frame.
<I1,…,IN>Representing the sequence of images numbered in time sequence during one training, corresponding to the 1 st frame to the Nth frame respectively.
The structural similarity loss function is used for describing the similarity between an uncompressed original picture and a compressed and distorted picture, and is used for measuring the performance of a compression algorithm. In the present disclosure, the SSIM loss function constrains the quality of the reconstructed picture from three aspects of brightness, contrast and structure, and the calculation formulas are as follows:
Figure BDA0002946807530000182
Figure BDA0002946807530000183
Figure BDA0002946807530000184
SSIM(x,y)=[l(x,y)]α·[c(x,y)]β·[s(x,y)]γ (14)
where l (x, y) represents luminance similarity, c (x, y) represents contrast similarity, and s (x, y) represents structural similarity. x and y are coordinate values, and generally, the parameter α ═ β ═ γ ═ 1, and u is set to be a valuex,uyAre the mean values of x, y, σ, respectivelyxyAre the variance, σ, of x, y, respectivelyxyIs the covariance of x and y; parameter c1=(k1L)2,c2=(k2L)2Which isThe middle L is the value range of the pixel value; parameter k1=0.01,k2=0.03,c3=0.5c2. In summary, there may be a loss of similarity metric L between two framesssim(x, y) as shown in equation (15):
Figure BDA0002946807530000185
loss of edge smoothing: usually, a depth map output by a network has a fault feeling, that is, the depth map of the surface of an object with the same depth is not smooth, and in order to make the depth value of the surface of the object in the depth map smoother and the depth value between the objects more hierarchical, an edge smoothing loss L is introducedsmoot hAs shown in equation (16):
Figure BDA0002946807530000186
wherein d istFor the depth value of the frame at time t,
Figure BDA0002946807530000191
is dtThe average value of (a) of (b),
Figure BDA0002946807530000192
namely, the average normalized inverse depth information; i istAre pixel values.
Figure BDA0002946807530000193
And
Figure BDA0002946807530000194
representing the differential operation in the x and y dimensions, respectively.
Fifthly, storing
When the method for estimating the depth shared by the encoders is used for carrying out attitude estimation decoding, not only the output of the features after the current frame is encoded needs to be received, but also the correlation calculation needs to be carried out on the features after the previous frame is encoded. Therefore, after the training of the current frame is completed, the encoded features and the original picture of the current frame need to be stored in a storage medium for use in the training of the next frame. Storage media include, but are not limited to, memory and video memory. To improve efficiency, the present disclosure proposes that the space for storing features and pictures in the storage medium should be kept fixed.
In conclusion, the posture estimation network is redesigned from the aspect of feature matching, the correlation calculation module and the dense convolution module are introduced, the performance of the posture estimation network is improved, and the high-precision posture transformation output between two frames is realized on the basis of the posture estimation decoder of the feature matching; because a design method based on feature sharing is adopted, the attitude estimation unit and the depth estimation unit share the same feature encoder, a single network structure is realized, the network complexity is reduced, a large amount of video memory can be saved in network training, and the network reasoning speed is accelerated, so that the occupation of the video memory and computing resources in the training and forward reasoning of the whole network is reduced, the whole speed and performance of the network are accelerated, and online learning can be realized on equipment with low power consumption; because the encoder based on the hybrid convolution mixes the depth separable convolution, the SE module and the residual convolution module and combines the cavity convolution, the inference speed of the network is increased while the network computing efficiency is improved, the parameter scale of the encoder is reduced, the characteristic encoding can be rapidly, efficiently and accurately carried out, and the network can carry out real-time inference on embedded equipment with limited resources.
Therefore, the method disclosed by the invention reduces the network calculation amount and the video memory occupation, improves the output precision of the attitude estimation network, reduces the requirements of the network on the video memory and the calculation resources and improves the calculation efficiency; the depth estimation task of the monocular camera can be accurately finished in real time in the unmanned vehicle, the streaming video data detected by the monocular camera in real time can be processed in real time, and the depth of an object in front of the unmanned vehicle can be quickly and accurately determined in real time.
Therefore, the novel feature sharing-based self-supervision monocular depth estimation method and system provided by the disclosure are very suitable for the outdoor use background environment of the unmanned vehicle, have low system calculation cost, can be used for a low-power-consumption vehicle-mounted processor, and do not need high-cost sensor system support; the depth estimation has good real-time performance, high precision and wide application prospect.
So far, the technical solutions of the present disclosure have been described in connection with the preferred embodiments shown in the drawings, but it should be understood by those skilled in the art that the above embodiments are only for clearly illustrating the present disclosure, and not for limiting the scope of the present disclosure, and it is apparent that the scope of the present disclosure is not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the disclosure, and the technical scheme after the changes or substitutions will fall into the protection scope of the disclosure.

Claims (10)

1. A self-supervision type monocular depth estimation method based on feature sharing is characterized in that,
a single network structure is adopted, the attitude estimation unit is integrated into the depth estimation unit, the integration of the operations of the depth estimation and the attitude estimation is realized, and a monocular single-source depth estimation network based on feature sharing is obtained; the monocular source depth estimation network includes: the system comprises a shared feature coding unit, a depth estimation unit, an attitude estimation unit and a supervision training unit;
the method comprises the following steps:
acquiring data, namely acquiring data from a video stream through a monocular camera and outputting an image frame;
step two, sharing feature coding, namely, after receiving the image frame, preprocessing the image frame, and outputting a multi-scale sharing feature group through an encoder;
decoding, namely receiving the multi-scale shared feature group, processing the multi-scale features through a depth estimation unit, and outputting a depth map under the original resolution; performing feature matching and decoding on the features of the current frame and the features of the previous frame through a pose estimation unit, and outputting pose transformation between the two frames;
and step four, loss calculation, namely combining the depth map output by the depth estimation unit with the posture transformation between the two frames output by the posture estimation unit, reconstructing the target frame, and further supervising the training of the network through the difference between the original target frame and the reconstructed target frame.
2. The feature sharing based self-supervised monocular depth estimation method of claim 1,
the monocular camera is deployed on an unmanned vehicle.
3. The feature sharing based self-supervised monocular depth estimation method of claim 1 or 2,
the monocular camera is deployed at the upper edge of a front windshield of the unmanned vehicle.
4. The feature sharing based auto-supervised monocular depth estimation method of any one of claims 1-3,
and reconstructing a target frame through projection and interpolation operations.
5. The feature sharing based self-supervised monocular depth estimation method of any one of claims 1-3, further comprising the steps of:
and step five, storing, namely storing the original image of the frame and the features output in the feature encoding step in a storage medium for the operation of the decoding step and the loss calculating step at the next moment.
6. The feature sharing based self-supervised monocular depth estimation method of claim 1 or 2,
the monocular camera resolution is 720P or more;
the monocular camera is a monocular camera in a monocular acquisition system, or any one of monocular cameras in a multi-view acquisition system comprising a plurality of monocular cameras.
7. The feature sharing based self-supervised monocular depth estimation method of claim 1 or 2,
in the first step, in a video stream generated by the monocular camera, sampling is carried out in real time according to a certain frequency to generate an image frame.
8. A readable storage medium comprising, in combination,
the readable storage medium has stored therein execution instructions for implementing the feature sharing based self-supervised monocular depth estimation method of any one of the preceding claims when executed by a processor.
9. An auto-supervised monocular depth estimation system based on feature sharing, comprising:
a memory storing a program to perform the feature sharing based auto-supervised monocular depth estimation method of any one of claims 1-8;
a processor; the processor executes the program.
10. An unmanned vehicle, comprising:
an on-board processor performing the feature sharing based self-supervised monocular depth estimation method of any one of claims 1-8.
CN202110196301.6A 2021-02-22 2021-02-22 Self-supervision type monocular depth estimation method based on feature sharing Pending CN113034563A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110196301.6A CN113034563A (en) 2021-02-22 2021-02-22 Self-supervision type monocular depth estimation method based on feature sharing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110196301.6A CN113034563A (en) 2021-02-22 2021-02-22 Self-supervision type monocular depth estimation method based on feature sharing

Publications (1)

Publication Number Publication Date
CN113034563A true CN113034563A (en) 2021-06-25

Family

ID=76460975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110196301.6A Pending CN113034563A (en) 2021-02-22 2021-02-22 Self-supervision type monocular depth estimation method based on feature sharing

Country Status (1)

Country Link
CN (1) CN113034563A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113592830A (en) * 2021-08-04 2021-11-02 航天信息股份有限公司 Image defect detection method and device and storage medium
CN113689326A (en) * 2021-08-06 2021-11-23 西南科技大学 Three-dimensional positioning method based on two-dimensional image segmentation guidance
CN113724155A (en) * 2021-08-05 2021-11-30 中山大学 Self-boosting learning method, device and equipment for self-supervision monocular depth estimation
CN114170304A (en) * 2021-11-04 2022-03-11 西安理工大学 Camera positioning method based on multi-head self-attention and replacement attention
CN114529982A (en) * 2022-01-14 2022-05-24 湖南大学 Lightweight human body posture estimation method and system based on stream attention
CN116245927A (en) * 2023-02-09 2023-06-09 湖北工业大学 ConvDepth-based self-supervision monocular depth estimation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325797A (en) * 2020-03-03 2020-06-23 华东理工大学 Pose estimation method based on self-supervision learning
US20200273190A1 (en) * 2018-03-14 2020-08-27 Dalian University Of Technology Method for 3d scene dense reconstruction based on monocular visual slam
CN112270692A (en) * 2020-10-15 2021-01-26 电子科技大学 Monocular video structure and motion prediction self-supervision method based on super-resolution

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200273190A1 (en) * 2018-03-14 2020-08-27 Dalian University Of Technology Method for 3d scene dense reconstruction based on monocular visual slam
CN111325797A (en) * 2020-03-03 2020-06-23 华东理工大学 Pose estimation method based on self-supervision learning
CN112270692A (en) * 2020-10-15 2021-01-26 电子科技大学 Monocular video structure and motion prediction self-supervision method based on super-resolution

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113592830A (en) * 2021-08-04 2021-11-02 航天信息股份有限公司 Image defect detection method and device and storage medium
CN113592830B (en) * 2021-08-04 2024-05-03 航天信息股份有限公司 Image defect detection method, device and storage medium
CN113724155A (en) * 2021-08-05 2021-11-30 中山大学 Self-boosting learning method, device and equipment for self-supervision monocular depth estimation
CN113724155B (en) * 2021-08-05 2023-09-05 中山大学 Self-lifting learning method, device and equipment for self-supervision monocular depth estimation
CN113689326A (en) * 2021-08-06 2021-11-23 西南科技大学 Three-dimensional positioning method based on two-dimensional image segmentation guidance
CN113689326B (en) * 2021-08-06 2023-08-04 西南科技大学 Three-dimensional positioning method based on two-dimensional image segmentation guidance
CN114170304A (en) * 2021-11-04 2022-03-11 西安理工大学 Camera positioning method based on multi-head self-attention and replacement attention
CN114529982A (en) * 2022-01-14 2022-05-24 湖南大学 Lightweight human body posture estimation method and system based on stream attention
CN114529982B (en) * 2022-01-14 2024-07-12 湖南大学 Lightweight human body posture estimation method and system based on streaming attention
CN116245927A (en) * 2023-02-09 2023-06-09 湖北工业大学 ConvDepth-based self-supervision monocular depth estimation method and system
CN116245927B (en) * 2023-02-09 2024-01-16 湖北工业大学 ConvDepth-based self-supervision monocular depth estimation method and system

Similar Documents

Publication Publication Date Title
CN111325794B (en) Visual simultaneous localization and map construction method based on depth convolution self-encoder
CN113034563A (en) Self-supervision type monocular depth estimation method based on feature sharing
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN111402310B (en) Monocular image depth estimation method and system based on depth estimation network
CN110910447B (en) Visual odometer method based on dynamic and static scene separation
CN110689008A (en) Monocular image-oriented three-dimensional object detection method based on three-dimensional reconstruction
CN110009674B (en) Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN111783582A (en) Unsupervised monocular depth estimation algorithm based on deep learning
CN114170286B (en) Monocular depth estimation method based on unsupervised deep learning
CN111950477A (en) Single-image three-dimensional face reconstruction method based on video surveillance
CN115187638B (en) Unsupervised monocular depth estimation method based on optical flow mask
CN111325782A (en) Unsupervised monocular view depth estimation method based on multi-scale unification
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN113077505A (en) Optimization method of monocular depth estimation network based on contrast learning
CN111354030A (en) Method for generating unsupervised monocular image depth map embedded into SENET unit
CN111860651A (en) Monocular vision-based semi-dense map construction method for mobile robot
CN112184731A (en) Multi-view stereo depth estimation method based on antagonism training
CN112906675A (en) Unsupervised human body key point detection method and system in fixed scene
CN113838102B (en) Optical flow determining method and system based on anisotropic dense convolution
CN115049739A (en) Binocular vision stereo matching method based on edge detection
CN116188550A (en) Self-supervision depth vision odometer based on geometric constraint
Madhuanand et al. Deep learning for monocular depth estimation from UAV images
CN117635801A (en) New view synthesis method and system based on real-time rendering generalizable nerve radiation field
CN116342675B (en) Real-time monocular depth estimation method, system, electronic equipment and storage medium
Saunders et al. Dyna-dm: Dynamic object-aware self-supervised monocular depth maps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination