CN114663496A - Monocular vision odometer method based on Kalman pose estimation network - Google Patents

Monocular vision odometer method based on Kalman pose estimation network Download PDF

Info

Publication number
CN114663496A
CN114663496A CN202210290482.3A CN202210290482A CN114663496A CN 114663496 A CN114663496 A CN 114663496A CN 202210290482 A CN202210290482 A CN 202210290482A CN 114663496 A CN114663496 A CN 114663496A
Authority
CN
China
Prior art keywords
pose
estimation network
network
loss function
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210290482.3A
Other languages
Chinese (zh)
Other versions
CN114663496B (en
Inventor
曾慧
修海鑫
刘红敏
樊彬
张利欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Shunde Graduate School of USTB
Original Assignee
University of Science and Technology Beijing USTB
Shunde Graduate School of USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB, Shunde Graduate School of USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202210290482.3A priority Critical patent/CN114663496B/en
Publication of CN114663496A publication Critical patent/CN114663496A/en
Application granted granted Critical
Publication of CN114663496B publication Critical patent/CN114663496B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/277Analysis of motion involving stochastic approaches, e.g. using Kalman filters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a monocular vision odometer method based on a Kalman pose estimation network, and belongs to the technical field of computer vision. The method comprises the following steps: constructing a depth estimation network and a pose estimation network based on Kalman filtering; calculating a photometric error loss function of a video image sequence based on motion weighting according to the pose transformation between each pair of adjacent frame images output by the pose estimation network and the depth image of the input frame output by the depth estimation network; introducing a variation automatic encoder structure into the constructed pose estimation network and the depth estimation network, and calculating a variation automatic encoder loss function; based on the obtained luminosity error loss function and the variational automatic encoder loss function, a training strategy aiming at the frame missing condition is adopted to train a pose estimation network and a depth estimation network; and estimating the camera pose corresponding to each frame of image by using the trained pose estimation network. By adopting the method and the device, the accuracy of the camera pose estimation can be improved and the frame missing condition can be adapted.

Description

Monocular vision odometer method based on Kalman pose estimation network
Technical Field
The invention relates to the technical field of computer vision, in particular to a monocular vision odometer method based on a Kalman pose estimation network.
Background
The visual odometer is used as a part of a simultaneous positioning and mapping technology and is widely applied to the fields of robot navigation, automatic driving, augmented reality, wearable computing and the like. The visual odometer is a method for estimating the current position and posture of a camera according to an input video image frame. The visual odometer can be classified into a monocular visual odometer, a binocular visual odometer, a visual odometer with inertial information fused, and the like, according to the type and number of the sensors. The monocular vision odometer has the advantages of only needing one camera, low requirement on hardware, no need of correction and the like.
The traditional visual odometry method firstly extracts and matches image features, and then estimates the relative pose between two adjacent frames according to the geometric relationship. The method achieves good results in practical application, is the mainstream method of the current visual odometer, and has the problem that the computing performance and the robustness are difficult to balance.
Monocular visual odometry based on deep learning can be divided into supervised and self-supervised methods. The self-supervision method only needs to input video image frames, does not need to collect real poses, does not depend on additional equipment, and is wider in applicability compared with the supervision method.
The existing many self-monitoring methods do not consider the association between frames, and the information between frames is not fully utilized, so that the trained network is difficult to estimate a more accurate pose, and the method can not adapt to the condition of frame missing. In addition, the moving object in the scene is inconsistent with the Euclidean transformation of the scene, and does not meet the assumption of a static scene, so that the motion of the scene is difficult to be described by one Euclidean transformation, and the estimation result of the network has deviation.
Disclosure of Invention
The embodiment of the invention provides a monocular vision odometer method based on a Kalman pose estimation network, which can improve the accuracy of camera pose estimation and adapt to the condition of frame loss. The technical scheme is as follows:
the embodiment of the invention provides a monocular vision odometer method based on a Kalman pose estimation network, which comprises the following steps:
constructing a depth estimation network and a pose estimation network based on Kalman filtering; the system comprises a pose estimation network and a depth estimation network, wherein the pose estimation network is used for outputting pose transformation between each pair of input adjacent frame images;
calculating a photometric error loss function of a video image sequence based on motion weighting according to the output pose transformation between each pair of adjacent frame images and the depth image of the input frame;
introducing a variation automatic encoder structure into the constructed pose estimation network and the depth estimation network, and calculating a loss function of the variation automatic encoder;
based on the obtained luminosity error loss function and the variational automatic encoder loss function, a training strategy aiming at the frame missing condition is adopted to train a pose estimation network and a depth estimation network;
and estimating the camera pose corresponding to each frame of image in the video image sequence of the pose to be estimated by using the trained pose estimation network.
Further, the pose estimation network includes: the system comprises a pose measurement network, a pose weighted fusion network, a pose updating network and a pose prediction network; wherein, the first and the second end of the pipe are connected with each other,
input adjacent frame image I through pose measurement networkt-1And ItCoding is carried out to obtain a pose measurement vector C at the time tmeasure,t
Cmeasure,t=Measure(It-1,It)
Wherein, It-1And ItImages respectively representing the time t-1 and the time t, and Measure () is the pose measurement network;
measuring pose by vector Cmeasure,tAnd pose prediction vector Cpred,tInputting the pose weighted fusion vector C into the pose weighted fusion network to obtain the pose weighted fusion vector C at the time tfuse,t
Cfuse,t=(1-Wt)*Cmeasure,t+Wt*Cpred,t
Wherein, WtOutput of [0, 1 ] for the last full link layer in the pose weighted fusion network]Weight in between; cpred,tIn the adjacent frame image It-2、It-1When inputting the pose estimation network, the pose prediction vector at the t moment output by the pose prediction network, Cpred,t=Predict(Cfuse,t-1),Cfuse,t-1Weighting and fusing the pose at the time t-1, and using Predict as the pose prediction network;
fusing pose weighting vector Cfuse,tInput pose update network estimation pose transformation Tt→t-1
Tt→t-1=Update(Cfuse,t)
Wherein Update () is the pose Update network; t ist→t-1Represents from It-1To ItThe 6 degree of freedom relative pose vector of (1), comprising: relative rotation and relative displacement.
Furthermore, both the pose estimation network and the depth estimation network adopt encoder-decoder structures.
Further, the calculating a photometric error loss function based on motion weighting for a video image sequence according to the output pose transformation between each pair of adjacent frame images and the input frame depth image comprises:
multiplying the pose transformation between each pair of adjacent frame images output by the pose estimation network to obtain the pose transformation in a longer time period, and calculating the photometric error between the images based on the motion weighting based on the obtained pose transformation in the longer time period;
and calculating a photometric error loss function based on motion weighting of the video image sequence according to the photometric error obtained by calculation.
Further, the multiplying the pose transformation between each pair of adjacent frame images output by the pose estimation network to obtain a pose transformation of a longer time period, and based on the obtained pose transformation of the longer time period, calculating the photometric error between the images based on the motion weighting comprises:
for a video image sequence with length N, the corresponding time is t0,t1,...,tN-1Accumulating and multiplying the poses between each pair of adjacent frame images output by the pose estimation network to obtain pose transformation in a longer period of time
Figure BDA0003561639670000031
Wherein the content of the first and second substances,
Figure BDA0003561639670000032
is from time tjTo time tiPose transformation between images; n is the length of each batch of video image sequences of the input pose estimation network and the depth estimation network;
for images
Figure BDA0003561639670000033
A point of
Figure BDA0003561639670000034
Its three-dimensional coordinates are represented by its depth image
Figure BDA0003561639670000035
Reduction; in the image
Figure BDA0003561639670000036
Upper corresponding projected point
Figure BDA0003561639670000037
Expressed as:
Figure BDA0003561639670000038
wherein K is a camera intrinsic parameter;
Figure BDA0003561639670000039
is tjA depth image of a time;
by aligning images
Figure BDA00035616396700000310
Sampling to obtain tjTime of day image
Figure BDA00035616396700000311
Is reconstructed image of
Figure BDA00035616396700000312
Figure BDA00035616396700000313
For the
Figure BDA00035616396700000314
Pixel of (2)
Figure BDA00035616396700000315
Use of
Figure BDA00035616396700000316
Calculating its motion weighting term Wmw
Figure BDA00035616396700000317
Using the resulting motion weighting term WmwCalculating an image
Figure BDA00035616396700000318
And
Figure BDA00035616396700000319
motion-weighted photometric error between:
Figure BDA00035616396700000320
wherein the content of the first and second substances,
Figure BDA00035616396700000321
representing images
Figure BDA00035616396700000322
And
Figure BDA00035616396700000323
based on the motion-weighted photometric error between,
Figure BDA0003561639670000041
representing an original image
Figure BDA0003561639670000042
And reconstructing the image
Figure BDA0003561639670000043
Structural similarity between them, α0、α1、α2For the hyper-parameter controlling the proportion of the parts, the symbol denotes the product between the pixels, | · |1Represents a 1-norm, | · |2Representing a2 norm.
Further, the obtained motion weighting term W is utilizedmwCalculating an image
Figure BDA0003561639670000044
And
Figure BDA0003561639670000045
before the photometric error based on motion weighting, the method further comprises:
the pixel involved in the photometric error calculation is determined and labeled as mask:
Figure BDA0003561639670000046
wherein the content of the first and second substances,
Figure BDA0003561639670000047
is tiThe time of the original image is determined,
Figure BDA0003561639670000048
is tjThe time of the original image is determined,
Figure BDA0003561639670000049
is from tiOriginal image of time
Figure BDA00035616396700000410
T obtained by samplingjTime of day image
Figure BDA00035616396700000411
Is reconstructed image, | · |*Representing a photometric error, i.e., a 1-norm or a 2-norm;
in order to calculate the image
Figure BDA00035616396700000412
And
Figure BDA00035616396700000413
based on motion weighted photometric errors, only mask-marked pixels are used for the calculation.
Further, the photometric error loss function is represented as:
Figure BDA00035616396700000414
wherein L ispA photometric error loss function is represented.
Further, the variational autoencoder loss function is represented as:
Figure BDA00035616396700000415
wherein L isVAERepresenting a variational autocoder loss function, xd、xpAll represent an input image, λ1、λ2All represent a hyper-parameter; p is a radical ofη(c) Is a prior distribution, c is the independent variable of the distribution; q. q.sd(cd|xd) Coding of networks for depth estimation cdThe sampled distribution of; q. q.sp(cp|xp) Coding of networks for depth estimation cpIs the KL divergence, KL (q)d(cd|xd)||pη(c) Is q representsd(cd|xd) For pη(c) KL divergence of (i), KL (q)p(cp|xp)||pη(c) Is q representsp(cp|xp) For pη(c) The KL divergence of (a),
Figure BDA00035616396700000416
to c is todAnd cpRespectively inputting the outputs obtained by the decoders of the depth estimation network and the pose estimation network, and further generating a reconstructed image
Figure BDA00035616396700000417
The probability distribution of (a) is determined,
Figure BDA00035616396700000418
representing a mathematical expectation, cd~qd(cd|xd) Denotes cdObey qd(cd|xd),cp~qp(cp|xp) Denotes cpObey qp(cp|xp),
Figure BDA00035616396700000419
Is shown in satisfying cd~qd(cd|xd) And cp~qp(cp|xp) Under the conditions of (a) under (b),
Figure BDA00035616396700000420
a mathematical expectation of (d); c. Cd~qd(cd|xd) Denotes cdObey qd(cd|xd) Distributing; c. Cp~qp(cp|xp) Denotes cpObey qp(cp|xp) And (4) distribution.
Further, the training strategy adopted for the frame missing condition based on the obtained photometric error loss function and the variational automatic encoder loss function to train the pose estimation network and the depth estimation network comprises:
for the output of the depth estimation network, a depth smoothing loss function is computed:
Figure BDA0003561639670000051
wherein the content of the first and second substances,
Figure BDA0003561639670000052
is parallax with the depth image DtIn an inverse proportional relationship with respect to each other,
Figure BDA0003561639670000053
denotes the partial derivatives in the x-and y-directions, ItIs the image at the time t;
determining a final loss function L based on the obtained depth smoothing loss function, photometric error loss function and variational automatic encoder loss function:
L=Lp+λLs+LVAE
wherein λ is a hyper-parameter controlling the depth smoothing loss function ratio, LpRepresenting a photometric error loss function, LVAERepresenting a variational autocoder loss function;
and training a pose estimation network and a depth estimation network by adopting a training strategy aiming at the frame missing condition by using the obtained final loss function.
Further, the training the pose estimation network and the depth estimation network by adopting the training strategy aiming at the frame missing condition comprises:
inputting all images in a batch of video image sequences into a pose estimation network and a depth estimation network, and training the pose estimation network and the depth estimation network;
inputting all images in a batch of video image sequences into a depth estimation network, setting zero of one or more frames of images in the batch of video image sequences, inputting the images into a pose estimation network, and training the pose estimation network and the depth estimation network.
The monocular vision odometer method based on the Kalman pose estimation network, provided by the embodiment of the invention, at least has the following advantages:
(1) aiming at the problems that the correlation between frames is not considered in many existing self-monitoring methods, and the information between the frames is not fully utilized, so that the trained network is difficult to estimate a more accurate pose and can not adapt to the frame missing condition, the embodiment constructs a pose estimation network based on Kalman filtering, and designs a training strategy aiming at the frame missing condition on the basis of the pose estimation network, so that the pose estimation network can estimate the current pose by utilizing the information between the frames and is more suitable for the frame missing condition;
(2) aiming at the problems that an Euclidean transformation of a moving object possibly existing in a scene is inconsistent with that of the scene, the assumption of a static scene is not satisfied, and the motion of the scene is difficult to be described by one Euclidean transformation, so that the estimation result of a pose estimation network is deviated.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a monocular vision odometry method based on a Kalman pose estimation network according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a pose estimation network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a work flow of a monocular vision odometry method based on a Kalman pose estimation network according to an embodiment of the present invention;
fig. 4 is a schematic diagram of the trajectories estimated by the method provided by the embodiment of the present invention on sequences 09 and 10 in the KITTI odometry dataset.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, an embodiment of the present invention provides a monocular vision odometry method based on a kalman pose estimation network, including:
s101, constructing a depth estimation network (DepthNet) and a pose estimation network (KF-PoseNet) based on Kalman filtering; the system comprises a pose estimation network and a depth estimation network, wherein the pose estimation network is used for outputting pose transformation between each pair of input adjacent frame images;
as shown in fig. 2, the pose estimation network includes: the system comprises a pose measurement network, a pose weighted fusion network, a pose updating network and a pose prediction network; wherein, as shown in Table 1,
the pose measurement network comprises a ResNet50 layer, three convolutional layers and a global averaging pooling layer; the first two layers of the three convolutional layers take ReLU (Rectification Linear Unit) as an activation function, and the last layer of the convolutional layers is a pure convolutional layer without an activation function; the input of the pose measurement network passes through ResNet50, then sequentially passes through three layers of convolutional layers, and finally is output through a full-play average pooling layer; the pose measurement network uses the ResNet50 structure as an encoder;
the pose weighted fusion network comprises 4 full connection layers and a weighted fusion layer; the first three layers of the 4 full connection layers use ReLU as an activation function, and the last layer of the 4 full connection layers use a Sigmoid function as an activation function; cmeasure,tAnd Cpred,tAfter the first full connection layer is input, the first full connection layer sequentially passes through the last three full connection layers, and a weight coefficient with a value range of 0-1 is output; the weight coefficient is further related to Cmeasure,tAnd Cpred,tSending the mixture into a weighted fusion layer;
the pose updating network comprises 4 fully-connected layers, and the first three fully-connected layers use ReLU as an activation function; the 4 full-connection layers are connected in sequence;
similar to the pose updating network, the pose prediction network also comprises 4 fully-connected layers, and the 4 fully-connected layers are connected in sequence.
TABLE 1 KF-PoseNet network architecture
Figure BDA0003561639670000071
In this embodiment, the working process of the pose estimation network is as follows:
input adjacent frame image I through pose measurement networkt-1And ItCoding is carried out to obtain a pose measurement vector C at the time tmeasure,t
Cmeasure,t=Measure(It-1,It)
Wherein, It-1And ItImages respectively representing the t-1 moment and the t moment, and Measure () is the pose measurement network; it should be noted that Cmeasure,tNot a 6 degree of freedom pose vector, but only the image pair (I)t-1,It) BitA coded vector of pose information;
measuring pose by vector Cmeasure,tAnd pose prediction vector Cpred,tInputting the pose weighted fusion vector C into the pose weighted fusion network to obtain the pose weighted fusion vector C at the time tfuse,t
Cfuse,t=(1-Wt)*Cmeasure,t+Wt*Cpred,t
Wherein, Wt=Weight(Cmeasure,t,Cpred,t) Output of [0, 1 ] for the last full link layer in the pose weighted fusion network]Weight between the pose and the pose, Weight is 4 full connection layers in the pose weighted fusion network; cpred,tIn the adjacent frame image It-2、It-1When inputting the pose estimation network, the pose prediction vector at the t moment output by the pose prediction network, Cpred,t=Predict(Cfuse,t-1),Cfuse,t-1Weighting and fusing the pose at the time t-1, and using Predict as the pose prediction network;
fusing pose weighting vector Cfuse,tInputting pose updating network estimation final pose transformation Tt→t-1
Tt→t-1=Update(Cfuse,t)
Wherein Update () is the pose Update network; t is a unit oft→t-1Represents from It-1To ItRelative pose vector of 6 degrees of freedom.
As shown in FIG. 3, the input of KF-PoseNet is two adjacent frames of images, the output is a 6-DOF relative pose vector, the first three elements of which represent 3-DOF relative rotation R, and the last three elements of which represent 3-DOF relative displacement t.
In this embodiment, both the pose estimation network and the depth estimation network adopt encoder-decoder structures, an encoder in the pose estimation network is a ResNet50 structure in the pose measurement network, and a decoder of the pose estimation network is a rest structure, a pose weighting fusion network, a pose prediction network and a pose update network except ResNet50 in the pose measurement network.
In this embodiment, the depth estimation network (DepthNet) also selects the ResNet50 structure as an encoder, uses a multilayer deconvolution structure similar to a DispNet decoder as a decoder, and is connected to the encoder through a skip link structure, and the output layer activation function is Sigmoid. In this embodiment, the input of DepthNet is a single frame image, and the output is normalized parallax D. To obtain the depth D, the reciprocal D of the obtained parallax needs to be 1/(aD + b), where a and b are parameters for limiting the output value range, and the output depth is between 0.1 and 100.
In this embodiment, in order to control the memory usage and keep the details as much as possible, the input RGB images of the pose estimation network and the depth estimation network are scaled to 832 × 256.
In this embodiment, the pair of adjacent frame images is set as the image I at the current time ttPicture I at the last instant t-1t-1. Adjacent frame image ItAnd It-1Inputting the pose estimation network and the depth estimation network to obtain pose transformation T between the adjacent frame imagest→t-1And the depth image Dt of each input frame.
S102, calculating a luminosity error loss function of a video image sequence based on motion weighting according to the pose transformation between each pair of output adjacent frame images and the depth image of an input frame; the method specifically comprises the following steps:
a1, multiplying the pose transformation between each pair of adjacent frame images output by the pose estimation network to obtain the pose transformation in a long time period, and calculating the photometric error between the images based on the motion weighting based on the obtained pose transformation in the long time period;
in this embodiment, there may be some fast moving objects in a scene. These objects are not consistent with the euclidean transforms of the camera. It is obviously not reasonable to treat the pixels corresponding to these objects equally when training the network. For the case that the motion amplitude in the data set is not large and the illumination change is not obvious, the brightness of the pixel at the same position in two adjacent frames does not change too much. Based on this, in order to reduce the influence of fast moving objects, the present invention designs photometric errors based on motion weighting. In order to enable the network to consider consistency of pose transformation in a long time, the embodiment calculates photometric errors constrained by long-time poses by using continuous multi-frame images when calculating photometric errors based on motion weighting, specifically:
for a video image sequence with length N, the corresponding time is t0,t1,...,tN-1Accumulating and multiplying the poses between each pair of adjacent frame images output by the pose estimation network to obtain pose transformation in a longer period of time
Figure BDA0003561639670000091
Figure BDA0003561639670000092
Wherein the content of the first and second substances,
Figure BDA0003561639670000093
is from time tjTo time tiPose transformation between images; n is the length of each batch of video image sequences of the input pose estimation network and the depth estimation network;
then, for the image
Figure BDA0003561639670000094
A point of
Figure BDA0003561639670000095
Whose three-dimensional coordinates may be represented by its depth image
Figure BDA0003561639670000096
Reduction; then it is in the image
Figure BDA0003561639670000097
Upper corresponding projection point
Figure BDA0003561639670000098
Can be calculated by the following formula:
Figure BDA0003561639670000099
wherein K is a camera intrinsic parameter;
Figure BDA00035616396700000910
is tjA depth image of a time;
the above formula ignores the calculation of part of the homogeneous coordinate system;
by aligning images
Figure BDA00035616396700000911
Sampling to obtain tjTime of day image
Figure BDA00035616396700000912
Is reconstructed image of
Figure BDA00035616396700000913
Figure BDA00035616396700000914
Then, for
Figure BDA00035616396700000915
Pixel of (2)
Figure BDA00035616396700000916
Can use
Figure BDA00035616396700000917
Calculating its motion weighting term Wmw
Figure BDA00035616396700000918
Finally, the obtained motion weighting term W is utilizedmwComputing images
Figure BDA00035616396700000919
And
Figure BDA00035616396700000920
photometric error based on motion weighting
Figure BDA00035616396700000921
Figure BDA00035616396700000922
Wherein the content of the first and second substances,
Figure BDA0003561639670000101
representing an original image
Figure BDA0003561639670000102
And reconstructing the image
Figure BDA0003561639670000103
Structural similarity between them, α0、α1、α2For the hyper-parameter controlling the proportion of the parts, the symbol denotes the product between the pixels, | · |1Represents a 1-norm, | · |2Representing a2 norm.
In this embodiment, the motion weighting term W described above is usedmwAnd weighting the calculated breadth error pixel by pixel to obtain the luminosity error weighted by the motion.
Further, it is considered that when an object that is stationary with respect to the camera exists in the field of view, the accuracy of the depth estimation may be affected, resulting in the estimated depth becoming infinite. For this purpose, a method of automatically marking still pixels is also used in this embodiment and removed from the training process. Specifically, pixels having errors smaller than the reconstruction error between the current image and the reference image are regarded as pixels stationary with respect to the camera, and the depth network is trained using only pixels having reconstruction errors smaller than the errors between the current image and the reference image (i.e., pixels involved in photometric error calculation).
In this embodiment, the pixels involved in the photometric error calculation are determined and marked as mask:
Figure BDA0003561639670000104
wherein the content of the first and second substances,
Figure BDA0003561639670000105
is tiThe time of the original image is determined,
Figure BDA0003561639670000106
is tjThe time of the original image is determined,
Figure BDA0003561639670000107
is from tiOriginal image of time
Figure BDA0003561639670000108
T obtained by samplingjTime of day image
Figure BDA0003561639670000109
Is reconstructed from the image, | · |*Representing a photometric error, i.e., a 1-norm or a 2-norm;
in order to calculate the image
Figure BDA00035616396700001010
And
Figure BDA00035616396700001011
when the luminosity error is based on the motion weighting, only the pixels marked by the mask are used for calculation, and then the pixels marked by the mask are used for network training.
A2, calculating a photometric error loss function L of video image sequence motion weighting according to the photometric error obtained by calculationp
Figure BDA00035616396700001012
Wherein L isp' represents the photometric error of the motion weighting.
S103, introducing a variation automatic encoder structure into the constructed pose estimation network and the constructed depth estimation network, and calculating a loss function of the variation automatic encoder;
in this embodiment, KF-PoseNet and DepthNet both use encoder-decoder structures; in order to improve the robustness of the output of a decoder to noise in the coding of the input of the decoder and improve the generalization capability of a network, a variable Auto-Encoder (VAE) structure is introduced into KF-PoseNet and DepthNet;
take a depth estimation network as an example;
encoder of depth estimation network inputs image xd=ItMapping to coding space to obtain mean vector Ed(xd);
Further, let q bed(cd|xd) For codes to be input to a decoder cdIs set as the mean value of the mean value E of the input imagedThe covariance being the covariance Σ of the input imagedGaussian distribution of
Figure BDA0003561639670000111
At qd(cd|xd) Random sampling in the distribution to obtain code cdWherein c isdObey qd(cd|xd) Distribution of using cd~qd(cd|xd) Represents;
further, c is encodeddAn input decoder obtains a depth image of an input image;
in order to meet the requirement of deep network back propagation, in this embodiment, when the code is randomly sampled in the coding space, the following reparameterization method is adopted to change the random sampling process into a micromanipulation: let η be Gaussian distribution obeying zero mean unit covariance
Figure BDA0003561639670000112
Random vector of (2):
Figure BDA0003561639670000113
where I is the identity matrix, then pair cd~qd(cd|xd) Can be passed through cd=Ed(xd)+∑dEta implementation, wheredIs the covariance of the input image;
the pose estimation network is the same;
further, a VAE loss function L is calculatedVAEComprises the following steps:
Figure BDA0003561639670000114
wherein x isd、xpAll representing the input image, over-parameter lambda1、λ2Weight, p, for controlling the target itemη(c) Is a prior distribution, c is the independent variable of the distribution; q. q.sd(cd|xd) Coding of networks for depth estimation cdIs sampled over a period of time qp(cp|xp) Coding of networks for depth estimation cpIs the KL divergence, KL (q)d(cd|xd)||pη(c) Is q representsd(cd|xd) For pη(c) KL divergence of (i), KL (q)p(cp|xp)||pη(c) Is q representsp(cp|xp) For pη(c) The KL divergence of (a) is,
Figure BDA0003561639670000115
to c isdAnd cpRespectively inputting the outputs obtained by the decoders of the depth estimation network and the pose estimation network, and further generating a reconstructed image
Figure BDA0003561639670000116
The probability distribution of (a) is determined,
Figure BDA0003561639670000117
representing a mathematical expectation, cd~qd(cd|xd) Denotes cdObey qd(cd|xd),cp~qp(cp|xp) Denotes cpObey qp(cp|xp),
Figure BDA0003561639670000118
Is shown in satisfying cd~qd(cd|xd) And cp~qp(cp|xp) Under the conditions of (a) under (b),
Figure BDA0003561639670000119
the mathematical expectation of (c); the first two items in the formula control the tendency that the distribution of KL divergence punishment hidden codes deviates from prior distribution; the last term, minimizing a non-negative log-likelihood term, is equivalent to minimizing a photometric error loss function; thus, the VAE loss function is actually only the first two terms in the formula.
In this embodiment, the prior distribution pη(c) 0 mean Gaussian distribution
Figure BDA00035616396700001110
S104, training a pose estimation network and a depth estimation network by adopting a training strategy aiming at the frame missing condition based on the obtained luminosity error loss function and the variation automatic encoder loss function; the method specifically comprises the following steps:
first, considering a texture-stable plane in three-dimensional space, its depth in the depth image tends not to vary too drastically. Therefore, in the present embodiment, for the output of the depth estimation network, the depth smoothing loss function L is also calculated as followss
Figure BDA0003561639670000121
Wherein the content of the first and second substances,
Figure BDA0003561639670000122
to look atDifference and depth image DtIn an inverse proportional relationship with respect to each other,
Figure BDA0003561639670000123
representing partial derivatives, I, in the x-and y-directions, respectivelytIs the image at time t;
in this embodiment, the depth smoothing loss function is calculated for each frame of image in each batch;
then, based on the obtained depth smoothing loss function, photometric error loss function and variational automatic encoder loss function, determining a final loss function L:
L=Lp+λLs+LVAE
wherein λ is a hyper-parameter controlling the depth smoothing loss function ratio, LpRepresenting a photometric error loss function, LVAERepresenting a variational autocoder loss function;
and finally, training a pose estimation network and a depth estimation network by adopting a training strategy aiming at the frame loss condition by using the obtained final loss function.
And S105, estimating the camera pose corresponding to each frame of image in the video image sequence of the pose to be estimated by using the trained pose estimation network.
In the embodiment, the pose estimation network (KF-PoseNet) based on Kalman filtering refers to the idea of Kalman filtering during design, and the multiple estimations are associated in time sequence, so that the KF-PoseNet in the invention can better adapt to the frame loss condition;
in the embodiment, during training, all images in a batch of video image sequences are input into the pose estimation network and the depth estimation network, and the pose estimation network and the depth estimation network are trained; further, aiming at the possible frame missing condition existing in the visual odometer, all images in a batch of video image sequences are input into the depth estimation network, one or more frames of images in the batch of video image sequences are input into the pose estimation network after being set to zero, and the pose estimation network and the depth estimation network are trained. For example, when N is 5, a batch simultaneously inputs 5 consecutive frames of images to the depth estimation network, and respectively inputs every two adjacent frames to the pose estimation network; further, aiming at the possible frame missing condition existing in the visual odometer, two frames of images are randomly set to zero from the last 3 frames of the five continuous frames input at one time, and then the images are input into the pose estimation network for training, while the input of the depth estimation network is still a complete image.
And after the training is finished, estimating the camera pose corresponding to each frame of image in the video image sequence of the pose to be estimated by using the trained pose estimation network.
The monocular vision odometer based on the Kalman pose estimation network can effectively estimate the camera pose corresponding to each frame according to the input image sequence and adapt to the frame missing condition. The invention is suitable for the self-supervision monocular vision mileometer.
The monocular vision odometer method based on the Kalman pose estimation network, provided by the embodiment of the invention, at least has the following advantages:
(1) aiming at the problems that the correlation between frames is not considered in many existing self-monitoring methods, and the information between the frames is not fully utilized, so that the trained network is difficult to estimate a more accurate pose and can not adapt to the frame missing condition, the embodiment constructs a pose estimation network based on Kalman filtering, and designs a training strategy aiming at the frame missing condition on the basis of the pose estimation network, so that the pose estimation network can estimate the current pose by utilizing the information between the frames and is more suitable for the frame missing condition;
(2) aiming at the problems that an Euclidean transformation of a moving object possibly existing in a scene is inconsistent with that of the scene, the assumption of a static scene is not satisfied, and the motion of the scene is difficult to be described by using one Euclidean transformation, so that the estimation result of a pose estimation network is deviated.
In order to verify the effectiveness of the monocular vision odometry method based on the Kalman pose estimation network provided by the embodiment of the invention, the performance of the method is tested by using an evaluation index provided in a KITTI odometry data set:
(1) relative displacement mean square error (rel.): the average displacement rmse (root Mean Square error) of all subsequences of a sequence of length 100, 200, … …, 800 meters, measured in% i.e. meters per 100 meters deviation, is as good as the smaller the value.
(2) Relative rotation mean square error (rel.): the average rotation RMSE, measured in deg/m, of all subsequences of 100, 200, … …, 800 meters length in a sequence is as small as possible.
In the embodiment, eight sequences 00-07 in a KITTI odometer data set are used as a training set and a verification set to train a pose estimation network and a depth estimation network, and two sequences 09-10 are used for testing the performance of the pose estimation network based on Kalman filtering for the self-supervision monocular vision odometer.
The KITTI odometer data set is a binocular image, radar points and actual tracks of the road environment in the city, which are acquired by equipment such as a vehicle-mounted camera.
In the implementation process, a depth estimation network and a pose estimation network based on Kalman filtering are constructed; the system comprises a pose estimation network and a depth estimation network, wherein the pose estimation network is used for outputting pose transformation between each pair of input adjacent frame images; calculating a photometric error loss function of a video image sequence based on motion weighting according to the output pose transformation between each pair of adjacent frame images and the depth image of the input frame; introducing a variation automatic encoder structure into the constructed pose estimation network and the depth estimation network, and calculating a loss function of the variation automatic encoder; based on the obtained luminosity error loss function and the variational automatic encoder loss function, a training strategy aiming at the frame missing condition is adopted to train a pose estimation network and a depth estimation network; and estimating the camera pose corresponding to each frame of image in the video image sequence of the pose to be estimated by using the trained pose estimation network.
In this embodiment, the parameter α of the hyperparametric of the photometric error loss function0=0.85,α1=0.1,α20.05, the parameter λ of the depth smoothing loss function is 10-3Parameter of VAE loss function λ1=λ20.01. In the training process of the network, the initial learning rate is 10-4And gradually reducing along with the training, wherein the learning rate is 0.97 times of that of the previous round after each round of iteration, and performing 45 iterations by adopting an Adam optimizer, wherein the batch size of each round of iteration is 2, and each batch contains 3 continuous frames of images.
In order to verify the performance of the method of the present invention, in this example, a monocular visual odometry method based on self-supervision of deep learning in recent years was selected for comparison, and the experimental results are shown in table 2. The generated trajectory in this embodiment is shown in fig. 4, where the dashed trajectory is a real trajectory, and the solid trajectory is the estimated trajectory in this embodiment.
As can be seen from table 2, the method described in this embodiment achieves better performance compared to other methods due to better utilization of information extracted from past time instants, weighting of motion pixels, and application of VAE structures.
TABLE 2 comparison of the method of this example with other methods
Figure BDA0003561639670000151
In order to verify the significance of the parts of the method described in this example, ablation experiments were also performed in this example. The experimental result is shown in table 3, where "without kalman structure" in the second row indicates that the kalman structure in the network is removed, the decoder structure of the pose estimation network is a four-layer convolutional layer, the activation function of the first three-layer convolutional layer is ReLU, and the output of the fourth layer is subjected to global averaging pooling to obtain a pose vector with 6 degrees of freedom. The third row to the fifth row respectively correspond to experimental results for removing motion weighting, a VAE structure and long-term consistency constraint in the network. The "# fc ═ 6" and "# fc ═ 2" in the sixth and seventh rows respectively represent experimental results of the pose estimation network decoder portion using fully-connected layers of different numbers of layers. The first row "basic" represents the experimental results without the addition of the above three structures. The last row represents experimental results of the complete method herein.
The experimental result shows that the structure similar to the Kalman structure enables the network to obtain reference from previous data when estimating the current adjacent frame, so that the current estimation result is more accurate; due to the introduction of motion weighting, the network can pay more attention to the pixels of static objects in the environment during training, and the interference of objects inconsistent with the Euclidean transformation of the camera is weakened; due to the introduction of the VAE structure, a decoder of the network has more robustness to noise in a result of an encoder, the generalization capability of the network is improved, and the result is further improved. Finally, the complete method herein achieves better experimental results. The performance of our method gradually increased with each part, and the significance of each part in our method is proved.
TABLE 3 ablation test results
Figure BDA0003561639670000161
Table 4 experimental results for the case of frame missing
Figure BDA0003561639670000162
The embodiment also performs an ablation experiment on the training strategy for the frame missing condition designed in the invention. During testing, the present embodiment adopts the way of setting the image of one frame to zero at the 50 th and 150 … … th frames and setting the image of two frames to zero at the 100 th and 200 … … th frames, so as to test the invention under the condition of frame missing. The test results are shown in table 4. The first row "without frame training" represents a result of training without using a training method for a frame missing condition in the present embodiment, the second row "without kalman structure" represents an experimental result of training without using a training method for a frame missing condition, and the third row represents an experimental result of training with a training method for a frame missing condition in the present embodiment. As can be seen from table 4, the method proposed in this embodiment can be well adapted to the frame missing situation.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A monocular vision odometry method based on a Kalman pose estimation network is characterized by comprising the following steps:
constructing a depth estimation network and a pose estimation network based on Kalman filtering; the system comprises a pose estimation network and a depth estimation network, wherein the pose estimation network is used for outputting pose transformation between each pair of input adjacent frame images;
calculating a photometric error loss function of a video image sequence based on motion weighting according to the output pose transformation between each pair of adjacent frame images and the depth image of the input frame;
introducing a variation automatic encoder structure into the constructed pose estimation network and the depth estimation network, and calculating a variation automatic encoder loss function;
based on the obtained luminosity error loss function and the variational automatic encoder loss function, a training strategy aiming at the frame missing condition is adopted to train a pose estimation network and a depth estimation network;
and estimating the camera pose corresponding to each frame of image in the video image sequence of the pose to be estimated by using the trained pose estimation network.
2. The monocular visual odometry method based on a kalman pose estimation network of claim 1, wherein the pose estimation network comprises: the system comprises a pose measurement network, a pose weighted fusion network, a pose updating network and a pose prediction network; wherein the content of the first and second substances,
input adjacent frame image I through pose measurement networkt-1And ItCoding is carried out to obtain a pose measurement vector C at the time tmeasure,t
Cmeasure,t=Measure(It-1,It)
Wherein, It-1And ItImages respectively representing the t-1 moment and the t moment, and Measure () is the pose measurement network;
measuring pose by vector Cmeasure,tAnd pose prediction vector Cpred,tInputting the pose weighted fusion vector C into the pose weighted fusion network to obtain the pose weighted fusion vector C at the time tfuse,t
Cfuse,t=(1-Wt)*Cmeasure,t+Wt*Cpred,t
Wherein, WtOutput of [0, 1 ] for the last full link layer in the pose weighted fusion network]Weight in between; cpred,tFor the adjacent frame image It-2、It-1When inputting the pose estimation network, the pose prediction vector at the t moment output by the pose prediction network, Cpred,t=Predict(Cfuse,t-1),Cfuse,t-1Weighting and fusing the pose at the time t-1, and using Predict as the pose prediction network;
fusing pose weighting vector Cfuse,tInput pose update network estimation pose transformation Tt→t-1
Tt→t-1=Update(Cfuse,t)
Wherein Update () is the pose Update network; t ist→t-1Represents from It-1To ItThe 6 degree of freedom relative pose vector of (1), comprising: relative rotation and relative displacement.
3. The monocular visual odometry method based on a kalman pose estimation network of claim 2, wherein the pose estimation network and the depth estimation network both employ an encoder-decoder structure.
4. The Kalman pose estimation network based monocular visual odometry method of claim 1, wherein the computing a motion-weighted based photometric error loss function for a video image sequence based on the pose transformation between each pair of output adjacent frame images and the depth image of the input frame comprises:
multiplying the pose transformation between each pair of adjacent frame images output by the pose estimation network to obtain the pose transformation in a longer time period, and calculating the photometric error between the images based on the motion weighting based on the obtained pose transformation in the longer time period;
and calculating a photometric error loss function based on motion weighting of the video image sequence according to the calculated photometric error.
5. The monocular vision odometry method based on the kalman pose estimation network of claim 4, wherein the multiplying the pose transformation between each pair of adjacent frame images output by the pose estimation network results in a pose transformation of a longer period of time, and the calculating the photometric error between the images based on the motion weighting based on the resulting pose transformation of a longer period of time comprises:
for a video image sequence with length N, the corresponding time is t0,t1,...,tN-1Accumulating and multiplying the poses between each pair of adjacent frame images output by the pose estimation network to obtain pose transformation in a longer period of time
Figure FDA0003561639660000021
Wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003561639660000022
is from time tjTo time tiThe pose between the images is changed; n is the length of each batch of video image sequences of the input pose estimation network and the depth estimation network;
for images
Figure FDA0003561639660000023
A point of
Figure FDA0003561639660000024
Its three-dimensional coordinates are represented by its depth image
Figure FDA0003561639660000025
Reduction; in the image
Figure FDA0003561639660000026
Upper corresponding projected point
Figure FDA0003561639660000027
Expressed as:
Figure FDA0003561639660000028
wherein K is a camera intrinsic parameter;
Figure FDA0003561639660000029
is tjA depth image of a time;
by aligning images
Figure FDA00035616396600000210
Sampling to obtain tjTime of day image
Figure FDA00035616396600000211
Is reconstructed image of
Figure FDA00035616396600000212
Figure FDA00035616396600000213
For
Figure FDA0003561639660000031
Pixel of (2)
Figure FDA0003561639660000032
Use of
Figure FDA0003561639660000033
Calculating its motion weighting term Wmw
Figure FDA0003561639660000034
Using the resulting motion-weighted term WmwComputing images
Figure FDA0003561639660000035
And
Figure FDA0003561639660000036
motion-weighted photometric error between:
Figure FDA0003561639660000037
wherein the content of the first and second substances,
Figure FDA0003561639660000038
representing images
Figure FDA0003561639660000039
And
Figure FDA00035616396600000310
based on the motion-weighted photometric error between,
Figure FDA00035616396600000311
representing an original image
Figure FDA00035616396600000312
And reconstructing the image
Figure FDA00035616396600000313
Structural similarity between them, α0、α1、α2To control the hyper-parameters of the proportion of the parts, the symbol denotes the product between pixels, | · survival1Represents 1 norm, | ·| non-conducting phosphor2Representing a2 norm.
6. The Kalman pose estimation network based monocular vision odometry method of claim 5, characterized in that the derived motion weighting term W is utilizedmwCalculating an image
Figure FDA00035616396600000314
And
Figure FDA00035616396600000315
before the motion-weighted based photometric error, the method further comprises:
the pixel involved in the photometric error calculation is determined and marked as mask:
Figure FDA00035616396600000316
wherein the content of the first and second substances,
Figure FDA00035616396600000317
is tiThe time of the original image is determined,
Figure FDA00035616396600000318
is tjThe time of the original image is determined,
Figure FDA00035616396600000319
is from tiOriginal image of time
Figure FDA00035616396600000320
T obtained by samplingjTime of day image
Figure FDA00035616396600000321
The reconstructed image, | · | | luminous flux*Representing a photometric error, i.e., a 1-norm or a 2-norm;
in order to calculate the image
Figure FDA00035616396600000322
And
Figure FDA00035616396600000323
based on motion weighted photometric errors, only mask-marked pixels are used for the calculation.
7. The Kalman pose estimation network based monocular vision odometry method of claim 5, wherein the photometric error loss function is expressed as:
Figure FDA0003561639660000041
wherein L ispA photometric error loss function is represented.
8. The Kalman pose estimation network based monocular visual odometry method of claim 1, characterized in that the variational autoencoder loss function is expressed as:
Figure FDA0003561639660000042
wherein L isVAERepresenting a variational autocoder loss function, xd、xpAll represent an input image, λ1、λ2All represent a hyper-parameter; p η (c) is the prior distribution, c is the independent variable of the distribution; q. q.sd(cd|xd) Coding of networks for depth estimation cdThe sampled distribution of; q. q.sp(cp|xp) Coding of networks for depth estimation cpIs the KL divergence, KL (q)d(cd|xd)||pη(c) Is q representsd(cd|xd) For pη(c) KL divergence of (Q)p(cp|xp)||pη(c) Is q representsp(cp|xp) For pη(c) The KL divergence of (a),
Figure FDA0003561639660000043
to c isdAnd cpRespectively inputting the output obtained by the decoders of the depth estimation network and the pose estimation network, and further generating a reconstructed image
Figure FDA0003561639660000044
The probability distribution of (a) is determined,
Figure FDA0003561639660000045
representing a mathematical expectation, cd~qd(cd|xd) Denotes cdObey qd(cd|xd),cp~qp(cp|xp) Denotes cpObey qp(cp|xp),
Figure FDA0003561639660000046
Is shown in satisfying cd~qd(cd|xd) And cp~gp(cp|xp) Under the conditions of (a) under (b),
Figure FDA0003561639660000047
a mathematical expectation of (d); c. Cd~qd(cd|xd) Denotes cdObey qd(cd|xd) Distributing; c. Cp~qp(cp|xp) Denotes cpComplianceqp(cp|xp) And (4) distribution.
9. The monocular visual odometry method based on a kalman pose estimation network of claim 1, wherein the training of the pose estimation network and the depth estimation network with the training strategy for the frame missing condition based on the obtained photometric error loss function and the variational automatic encoder loss function comprises:
for the output of the depth estimation network, a depth smoothing loss function is computed:
Figure FDA0003561639660000048
wherein the content of the first and second substances,
Figure FDA0003561639660000049
is parallax, is inversely proportional to the depth image Dt,
Figure FDA00035616396600000410
representing partial derivatives, I, in the x-and y-directions, respectivelytIs the image at time t;
determining a final loss function L based on the obtained depth smoothing loss function, photometric error loss function and variational automatic encoder loss function:
L=Lp+λLs+LVAE
wherein λ is a hyper-parameter controlling the depth smoothing loss function ratio, LpRepresenting a photometric error loss function, LVAERepresenting a variational autocoder loss function;
and training a pose estimation network and a depth estimation network by adopting a training strategy aiming at the frame missing condition by using the obtained final loss function.
10. The Kalman pose estimation network based monocular visual odometry method of claim 1, wherein the training the pose estimation network and the depth estimation network with a training strategy for frame loss comprises:
inputting all images in a batch of video image sequence into a pose estimation network and a depth estimation network, and training the pose estimation network and the depth estimation network;
inputting all images in a batch of video image sequences into a depth estimation network, setting zero of one or more frames of images in the batch of video image sequences, inputting the images into a pose estimation network, and training the pose estimation network and the depth estimation network.
CN202210290482.3A 2022-03-23 2022-03-23 Monocular vision odometer method based on Kalman pose estimation network Active CN114663496B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210290482.3A CN114663496B (en) 2022-03-23 2022-03-23 Monocular vision odometer method based on Kalman pose estimation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210290482.3A CN114663496B (en) 2022-03-23 2022-03-23 Monocular vision odometer method based on Kalman pose estimation network

Publications (2)

Publication Number Publication Date
CN114663496A true CN114663496A (en) 2022-06-24
CN114663496B CN114663496B (en) 2022-10-18

Family

ID=82031748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210290482.3A Active CN114663496B (en) 2022-03-23 2022-03-23 Monocular vision odometer method based on Kalman pose estimation network

Country Status (1)

Country Link
CN (1) CN114663496B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115131404A (en) * 2022-07-01 2022-09-30 上海人工智能创新中心 Monocular 3D detection method based on motion estimation depth
CN115841151A (en) * 2023-02-22 2023-03-24 禾多科技(北京)有限公司 Model training method and device, electronic equipment and computer readable medium
CN116612182A (en) * 2023-07-19 2023-08-18 煤炭科学研究总院有限公司 Monocular pose estimation method and monocular pose estimation device
CN117197229A (en) * 2023-09-22 2023-12-08 北京科技大学顺德创新学院 Multi-stage estimation monocular vision odometer method based on brightness alignment
CN117214860A (en) * 2023-08-14 2023-12-12 北京科技大学顺德创新学院 Laser radar odometer method based on twin feature pyramid and ground segmentation
CN117974721A (en) * 2024-04-01 2024-05-03 合肥工业大学 Vehicle motion estimation method and system based on monocular continuous frame images

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150124882A1 (en) * 2013-11-05 2015-05-07 Arris Enterprises, Inc. Bit depth variable for high precision data in weighted prediction syntax and semantics
CN108665496A (en) * 2018-03-21 2018-10-16 浙江大学 A kind of semanteme end to end based on deep learning is instant to be positioned and builds drawing method
CN110490928A (en) * 2019-07-05 2019-11-22 天津大学 A kind of camera Attitude estimation method based on deep neural network
US20200041276A1 (en) * 2018-08-03 2020-02-06 Ford Global Technologies, Llc End-To-End Deep Generative Model For Simultaneous Localization And Mapping
CN110910447A (en) * 2019-10-31 2020-03-24 北京工业大学 Visual odometer method based on dynamic and static scene separation
CN112102399A (en) * 2020-09-11 2020-12-18 成都理工大学 Visual mileage calculation method based on generative antagonistic network
CN113108771A (en) * 2021-03-05 2021-07-13 华南理工大学 Movement pose estimation method based on closed-loop direct sparse visual odometer
CN113483762A (en) * 2021-07-05 2021-10-08 河南理工大学 Pose optimization method and device
US20220036577A1 (en) * 2020-07-30 2022-02-03 Apical Limited Estimating camera pose
CN114022527A (en) * 2021-10-20 2022-02-08 华中科技大学 Monocular endoscope depth and pose estimation method and device based on unsupervised learning

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150124882A1 (en) * 2013-11-05 2015-05-07 Arris Enterprises, Inc. Bit depth variable for high precision data in weighted prediction syntax and semantics
CN108665496A (en) * 2018-03-21 2018-10-16 浙江大学 A kind of semanteme end to end based on deep learning is instant to be positioned and builds drawing method
US20200041276A1 (en) * 2018-08-03 2020-02-06 Ford Global Technologies, Llc End-To-End Deep Generative Model For Simultaneous Localization And Mapping
CN110490928A (en) * 2019-07-05 2019-11-22 天津大学 A kind of camera Attitude estimation method based on deep neural network
CN110910447A (en) * 2019-10-31 2020-03-24 北京工业大学 Visual odometer method based on dynamic and static scene separation
US20220036577A1 (en) * 2020-07-30 2022-02-03 Apical Limited Estimating camera pose
CN112102399A (en) * 2020-09-11 2020-12-18 成都理工大学 Visual mileage calculation method based on generative antagonistic network
CN113108771A (en) * 2021-03-05 2021-07-13 华南理工大学 Movement pose estimation method based on closed-loop direct sparse visual odometer
CN113483762A (en) * 2021-07-05 2021-10-08 河南理工大学 Pose optimization method and device
CN114022527A (en) * 2021-10-20 2022-02-08 华中科技大学 Monocular endoscope depth and pose estimation method and device based on unsupervised learning

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CHUNHUI ZHAO ET.AL: "Pose estimation for multi-camera systems", 《2017 IEEE INTERNATIONAL CONFERENCE ON UNMANNED SYSTEMS (ICUS)》 *
UGUR KAYASAL: "《磁力计辅助的惯性导航*** 基于IMU和磁力计的导航***建模和仿真》", 28 February 2017 *
YAN WANG ET.AL: "Unsupervised Learning of Accurate Camera Pose and Depth From Video Sequences With Kalman Filter", 《IEEE ACCESS》 *
周凯等: "动态环境下融合边缘信息的稠密视觉里程计算法", 《哈尔滨工业大学学报》 *
孟庆鑫等: "《机器人技术基础》", 30 September 2006 *
张玮奇: "基于学习的单目同步定位与地图构建方法研究", 《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115131404A (en) * 2022-07-01 2022-09-30 上海人工智能创新中心 Monocular 3D detection method based on motion estimation depth
CN115131404B (en) * 2022-07-01 2024-06-14 上海人工智能创新中心 Monocular 3D detection method based on motion estimation depth
CN115841151A (en) * 2023-02-22 2023-03-24 禾多科技(北京)有限公司 Model training method and device, electronic equipment and computer readable medium
CN116612182A (en) * 2023-07-19 2023-08-18 煤炭科学研究总院有限公司 Monocular pose estimation method and monocular pose estimation device
CN116612182B (en) * 2023-07-19 2023-09-29 煤炭科学研究总院有限公司 Monocular pose estimation method and monocular pose estimation device
CN117214860A (en) * 2023-08-14 2023-12-12 北京科技大学顺德创新学院 Laser radar odometer method based on twin feature pyramid and ground segmentation
CN117214860B (en) * 2023-08-14 2024-04-19 北京科技大学顺德创新学院 Laser radar odometer method based on twin feature pyramid and ground segmentation
CN117197229A (en) * 2023-09-22 2023-12-08 北京科技大学顺德创新学院 Multi-stage estimation monocular vision odometer method based on brightness alignment
CN117197229B (en) * 2023-09-22 2024-04-19 北京科技大学顺德创新学院 Multi-stage estimation monocular vision odometer method based on brightness alignment
CN117974721A (en) * 2024-04-01 2024-05-03 合肥工业大学 Vehicle motion estimation method and system based on monocular continuous frame images

Also Published As

Publication number Publication date
CN114663496B (en) 2022-10-18

Similar Documents

Publication Publication Date Title
CN114663496B (en) Monocular vision odometer method based on Kalman pose estimation network
CN114782691B (en) Robot target identification and motion detection method based on deep learning, storage medium and equipment
CN109271933B (en) Method for estimating three-dimensional human body posture based on video stream
CN107424177B (en) Positioning correction long-range tracking method based on continuous correlation filter
CN110490928A (en) A kind of camera Attitude estimation method based on deep neural network
Varma et al. Transformers in self-supervised monocular depth estimation with unknown camera intrinsics
CN114663509B (en) Self-supervision monocular vision odometer method guided by key point thermodynamic diagram
CN111145255B (en) Pose calculation method and system combining deep learning and geometric optimization
CN110610486B (en) Monocular image depth estimation method and device
CN112233179B (en) Visual odometer measuring method
CN113256698B (en) Monocular 3D reconstruction method with depth prediction
CN111325784A (en) Unsupervised pose and depth calculation method and system
CN110264526B (en) Scene depth and camera position and posture solving method based on deep learning
CN110942484B (en) Camera self-motion estimation method based on occlusion perception and feature pyramid matching
CN117542122B (en) Human body pose estimation and three-dimensional reconstruction method, network training method and device
CN114612545A (en) Image analysis method and training method, device, equipment and medium of related model
CN110428461A (en) In conjunction with the monocular SLAM method and device of deep learning
CN111275751B (en) Unsupervised absolute scale calculation method and system
Son et al. Partial convolutional LSTM for spatiotemporal prediction of incomplete data
CN115482252A (en) Motion constraint-based SLAM closed loop detection and pose graph optimization method
Li et al. Unsupervised joint learning of depth, optical flow, ego-motion from video
CN114067371B (en) Cross-modal pedestrian trajectory generation type prediction framework, method and device
CN114485417B (en) Structural vibration displacement identification method and system
CN115830707A (en) Multi-view human behavior identification method based on hypergraph learning
KR20200095251A (en) Apparatus and method for estimating optical flow and disparity via cycle consistency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant