CN114526728A - Monocular vision inertial navigation positioning method based on self-supervision deep learning - Google Patents

Monocular vision inertial navigation positioning method based on self-supervision deep learning Download PDF

Info

Publication number
CN114526728A
CN114526728A CN202210041486.8A CN202210041486A CN114526728A CN 114526728 A CN114526728 A CN 114526728A CN 202210041486 A CN202210041486 A CN 202210041486A CN 114526728 A CN114526728 A CN 114526728A
Authority
CN
China
Prior art keywords
inertial navigation
data
navigation sensor
original
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210041486.8A
Other languages
Chinese (zh)
Other versions
CN114526728B (en
Inventor
项志宇
刘磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202210041486.8A priority Critical patent/CN114526728B/en
Publication of CN114526728A publication Critical patent/CN114526728A/en
Application granted granted Critical
Publication of CN114526728B publication Critical patent/CN114526728B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/10Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 by using measurements of speed or acceleration
    • G01C21/12Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 by using measurements of speed or acceleration executed aboard the object being navigated; Dead reckoning
    • G01C21/16Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 by using measurements of speed or acceleration executed aboard the object being navigated; Dead reckoning by integrating acceleration or speed, i.e. inertial navigation
    • G01C21/165Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 by using measurements of speed or acceleration executed aboard the object being navigated; Dead reckoning by integrating acceleration or speed, i.e. inertial navigation combined with non-inertial navigation instruments
    • G01C21/1656Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 by using measurements of speed or acceleration executed aboard the object being navigated; Dead reckoning by integrating acceleration or speed, i.e. inertial navigation combined with non-inertial navigation instruments with passive imaging devices, e.g. cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Automation & Control Theory (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a monocular vision inertial navigation positioning method based on self-supervision deep learning. The method comprises the steps of acquiring and processing data of each scene, and respectively obtaining image preprocessing data, normalized original inertial navigation sensor data and corresponding inertial navigation sensor pre-integral data; aligning data according to the time stamp and forming a training set; inputting the training set into a network for training to obtain a trained network; acquiring data of a target scene, processing the data to obtain data of the target scene, inputting the data into a network, outputting a predicted relative pose of 6 degrees of freedom, and performing absolute scale processing on the predicted relative pose of 6 degrees of freedom by using a least square method based on preprocessed data of an inertial navigation sensor of the target scene to obtain an absolute pose of 6 degrees of freedom of a camera. The invention improves the accuracy of pose estimation and depth estimation, does not need to carry out strict space-time calibration and initialization on the vision and inertial navigation sensors, and has strong robustness.

Description

Monocular vision inertial navigation positioning method based on self-supervision deep learning
Technical Field
The invention relates to a navigation positioning method of an intelligent vehicle, in particular to a monocular vision inertial navigation positioning method based on self-supervision deep learning.
Background
High-precision navigation and positioning are the core technologies of the automatic driving automobile. Conventional high-precision positioning means such as differential GPS and inertial navigation devices have large errors in some situations where GPS signals are poor. Secondly, the price of the high-precision inertial navigation equipment is often higher, and the high-precision inertial navigation equipment is not suitable for a commercial unmanned automobile scheme. In contrast, some semantic map-based matching positioning schemes are relatively cheap, but in consideration of errors of visual semantic perception and sparsity of semantic elements in the semantic map, such schemes cannot achieve positioning in any scene. A Visual Inertial odometer (Visual Inertial odometer) is a combined positioning method for fusing image Visual and Inertial device data. The combination of the camera and the cheap inertia device can effectively inhibit the drift of the inertia device, overcome the problems of scale loss, large influence of ambient light and the like in the visual odometer, and is an effective means for realizing low-cost high-precision positioning. However, the conventional visual inertial navigation odometer based on the physical model relies on robust initialization and strict calibration of the visual sensor and the inertial navigation sensor, and when the initialization fails, the calibration is inaccurate, the scene is fuzzy or the scene features are sparse, the positioning performance is obviously reduced. With the successful application of deep learning in the field of computer vision, the visual inertial navigation odometer based on the deep learning method is gradually concerned. However, the fully supervised visual odometer needs to provide a true pose value for training, which limits the application of the method.
Disclosure of Invention
In order to solve the problems in the background art, the invention aims to provide a monocular vision inertial navigation positioning method based on self-supervision deep learning, which is suitable for the fields of unmanned vehicles, robots and the like which need vision positioning.
The invention solves the problems of camera pose estimation and depth estimation of the visual inertial navigation combination by utilizing the parameter learning capability and the nonlinear model fitting capability of the neural network and utilizing the deep learning. Different sub-networks are utilized to respectively model the visual image, the inertial navigation sensor raw data and the pre-integration data thereof, and parameters of a network model are optimized through self-supervision training. In the testing stage, given the image data of the target scene and the corresponding inertial navigation sensor data, the pose with 6 degrees of freedom of absolute scale can be obtained.
The technical scheme adopted by the invention comprises the following steps:
1) respectively acquiring original image data and original inertial navigation sensor data of each scene by using a camera and inertial navigation in the monocular vision inertial navigation odometer, and preprocessing the original image data of the scenes to obtain image preprocessing data; performing pre-integration on the original inertial navigation sensor data to obtain inertial navigation sensor pre-integration data, performing normalization processing on the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integration data of the current scene respectively, and obtaining the normalized original inertial navigation sensor data and the normalized inertial navigation sensor pre-integration data respectively;
2) respectively aligning the normalized original inertial navigation sensor data and inertial navigation sensor pre-integral data of the current scene with image preprocessing data according to a timestamp to respectively obtain the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data, and forming a training set by the image preprocessing data of a plurality of scenes, the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data;
3) inputting the training set into a deep learning network of the self-supervision visual inertial navigation odometer for training to obtain a well-trained deep learning network of the self-supervision visual inertial navigation odometer;
4) the method comprises the steps of collecting original image data and original inertial navigation sensor data of a target scene, performing pre-integration on the original inertial navigation sensor data to obtain inertial navigation sensor pre-integration data, processing the original image data and the original inertial navigation sensor data of the target scene and the corresponding inertial navigation sensor pre-integration data respectively to obtain normalized original inertial navigation sensor data, inertial navigation sensor pre-integration data and image pre-processing data of the target scene, inputting the normalized original inertial navigation sensor data, the normalized inertial navigation sensor pre-integration data and the image pre-processing data into a trained self-supervision visual inertial navigation odometer deep learning network, outputting a predicted relative pose of 6 degrees of freedom, and performing absolute scale processing on the predicted relative pose of 6 degrees of freedom by using a least square method based on the inertial navigation sensor pre-processing data of the target scene to obtain an absolute pose of 6 degrees of freedom of a camera.
In the step 1), cutting the image size of the original image data of the scene to a preset size to obtain image preprocessing data; and then respectively carrying out zero-mean normalization processing on the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data of the scene to respectively obtain the normalized original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data.
The self-supervision visual inertial navigation odometer deep learning network in the step 3) is composed of a target image frame extraction module, a depth estimation network and a pose estimation network, wherein the target image frame extraction module is respectively connected with the depth estimation network and the pose estimation network;
the image preprocessing data in the training set is input into a target image frame extraction module, the target image frame extraction module respectively outputs an original target image frame and a source image frame, the original target image frame is input into a depth estimation network, and the original target image frame, the source image frame, the aligned original inertial navigation sensor data in the training set and the corresponding inertial navigation sensor pre-integration data are input into a pose estimation network.
The pose estimation network comprises a convolutional neural network, a depth self-attention transformation network, a first multilayer perceptron and a second multilayer perceptron;
the method comprises the steps that an original target image frame and a source image frame are input into a convolutional neural network, aligned original inertial navigation sensor data are input into a depth self-attention transformation network, aligned inertial navigation sensor preprocessing data are input into a first multilayer perceptron, the outputs of the convolutional neural network, the depth self-attention transformation network and the first multilayer perceptron are subjected to feature fusion and then input into a second multilayer perceptron, and the output of the second multilayer perceptron is used as the output of a pose estimation network.
The self-supervision visual inertial navigation odometer deep learning network carries out view reconstruction by using a view reconstruction method based on a source image frame output by a target image frame extraction module, a relative pose with 6 degrees of freedom output by a pose estimation network, a depth map output by the depth estimation network and an internal reference matrix of a camera in a monocular visual inertial navigation odometer, obtains a reconstructed target image frame, carries out bilinear interpolation on the reconstructed target image frame by using a bilinear sampling method to obtain a predicted target image frame, finally calculates a total loss function based on the predicted target image frame and an original target image frame output by an original target image frame extraction module, and trains the self-supervision visual inertial navigation odometer deep learning network according to the total loss function.
The image preprocessing data input in the target image frame extraction module is an image sequence, each image sequence comprises at least two images, when the number of the images of each image sequence is 2, the image of the previous frame is used as a source image frame, and the image of the next frame is used as an original target image frame; when the number of the images of each image sequence is odd, the images of the intermediate frames are used as original target image frames, and the images of the rest frames are used as source image frames together.
The depth self-attention transformation network comprises a position coding module, 6 coding modules and a third multilayer perceptron, the aligned original inertial navigation sensor data is input into the position coding module, the output of the position coding module is added with the aligned original inertial navigation sensor data and then input into a first coding module, the first coding module is connected with a sixth coding module after sequentially passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module, the output of the sixth coding module is input into the third multilayer perceptron, and the output of the third multilayer perceptron is used as the output of the depth self-attention transformation network; the 6 coding modules have the same structure and comprise a multi-head attention mechanism module, a first residual error module, a first normalization layer, a feedforward neural network, a second residual error module and a second normalization layer; in each coding module, the input of the current coding module is respectively input into a multi-head attention mechanism module and a first residual module of the current coding module, the output of the multi-head attention mechanism module and the output of the first residual module are added and then input into a first normalization layer, the output of the first normalization layer is respectively input into a feedforward neural network and a second residual module, the output of the feedforward neural network and the output of the second residual module are added and then input into a second normalization layer, and the output of the second normalization layer is used as the output of the current coding module.
The absolute pose of the 6 degrees of freedom of the camera in the step 4) is obtained by multiplying the absolute scale of the translation amount by the relative pose of the 6 degrees of freedom of the camera, wherein the calculation formula of the absolute scale of the translation amount is as follows:
Scale*=arg min||Yi-Scale*Xi||2
wherein, Scale*Representing the absolute scale of the final translation, the relative pose of 6 degrees of freedom includes the translation XiAnd amount of rotation, YiRepresenting the pre-integration data of the aligned inertial navigation sensor, argmin | | | | | non-woven cells2Represents the least squares operation and Scale represents the absolute Scale of the amount of translation during the calculation.
The invention solves the problems of navigation positioning estimation and depth estimation of a visual inertial navigation combined sensor by utilizing the parameter learning and nonlinear model fitting capability of a neural network and utilizing self-supervision deep learning. Different networks such as a convolutional neural network, a deep self-attention transformation network, a multilayer perceptron and the like are utilized to respectively model the relationship between the visual image, the inertial navigation sensor raw data and the pre-integration data thereof and the navigation parameters, and the parameters of the model are optimized through long-time training. Due to addition of inertial navigation sensor data, under the conditions of only using monocular vision data and adopting self-supervision learning, the pose estimation and the scene depth estimation with 6 degrees of freedom of absolute scale can be finally obtained.
Compared with the background art, the invention has the beneficial effects that:
(1) the robustness of the monocular vision odometer method based on the self-supervision deep learning can be improved, and strict space-time calibration on the vision sensor and the inertial navigation sensor is not needed;
(2) according to the invention, the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integration data are introduced into the learning process of the network model, so that the pose estimation and depth estimation precision of the self-supervision deep learning monocular vision odometer are further improved;
(3) the invention can finally obtain the pose estimation and the scene depth estimation with 6 degrees of freedom of absolute scale under the conditions of only using monocular image data and adopting self-supervision learning.
In sum, compared with a pure monocular vision odometer based on self-supervision deep learning, the method provided by the invention has the advantages that the original inertial navigation sensor data and the inertial navigation sensor pre-integration data are increased, the accuracy of pose estimation and depth estimation is improved, and the robustness of the model is improved. Particularly, compared with the traditional vision and inertial navigation combined positioning method, the method does not need to carry out strict space-time calibration on the vision sensor and the inertial navigation sensor, and has strong robustness.
Drawings
FIG. 1 is a schematic diagram of the overall network framework of the method of the present invention;
FIG. 2 is a schematic diagram of inertial navigation sensor data pre-integration;
fig. 3 is a schematic diagram of a deep self-attention transformation network structure constructed by the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The embodiment and the implementation process of the complete method according to the invention content are as follows:
the invention comprises the following steps:
1) the monocular vision inertial navigation odometer mainly comprises a camera and inertial navigation, and a color camera and inertial navigation in the monocular vision inertial navigation odometer are used for respectively acquiring original image data and original inertial navigation sensor data of each scene, wherein the color camera acquires image data at relatively low frequency, and the inertial navigation sensor acquires acceleration and angular velocity data in three directions of the host vehicle or the robot at relatively high frequency synchronously; the original inertial navigation sensor data comprise 3-axis acceleration and 3-axis angular velocity of the inertial navigation sensor; the data acquisition frequency of the inertial navigation sensor is more than 10 times of the shooting frequency of the color camera. In specific implementation, the acquisition frequency of the color camera is 10Hz, and the acquisition frequency of the inertial navigation sensor is 100 Hz. Preprocessing original image data of a scene to obtain image preprocessing data; performing pre-integration on original inertial navigation sensor data IMU to obtain inertial navigation sensor pre-integration data, performing zero-mean normalization processing on the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integration data of the current scene respectively, and obtaining normalized original inertial navigation sensor data and inertial navigation sensor pre-integration data respectively;
in the step 1), in order to eliminate the influence of adverse factors, images and inertial navigation data acquired when dynamic objects are more in a scene and are seriously shielded or an autonomous vehicle stops are screened out, the image size of original image data of the scene is cut to a preset size, and image preprocessing data are obtained; in one embodiment, the predetermined size is 640 × 192. If the image size of the image data is too small, edge filling needs to be carried out on the image data, so that the image size of the image data is a preset size; as the data acquisition frequency of the inertial navigation sensor is far greater than the acquisition frequency of the image data, dozens of groups of inertial navigation data are corresponding between two adjacent frames of images, and each group of data corresponds to the constraint of one pose. In fact, only one constraint is needed between two consecutive images, and therefore, the inertial sensor data within the time instant of two consecutive images needs to be integrated, as shown in fig. 2. In order to avoid serious repeated calculation problems and waste of calculation resources caused by the change of the integral preceding term each time, pre-integral operation is carried out on the inertial navigation sensor data; in specific implementation, pre-integration processing is performed on 10 pieces of inertial navigation data corresponding to two adjacent frames of images.
In order to eliminate the influence of dimension of the inertial navigation sensor data, so that different data or indexes have comparability and the convergence of the network is ensured, zero-mean normalization processing needs to be performed on the original data and the pre-integral data before the original data and the pre-integral data are input into the network, and the normalized original inertial navigation sensor data and the corresponding pre-integral data of the inertial navigation sensor are obtained.
2) Aligning high-frequency normalized original inertial navigation sensor data and corresponding inertial navigation sensor pre-integral data of a current scene with low-frequency image preprocessing data according to a timestamp to obtain the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data respectively, and forming a training set by the image preprocessing data of a plurality of scenes, the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data;
3) inputting the training set into a deep learning network of the self-supervision visual inertial navigation odometer for training to obtain a well-trained deep learning network of the self-supervision visual inertial navigation odometer;
as shown in fig. 1, the self-supervision visual inertial navigation odometer deep learning network in the step 3) is composed of a target image frame extraction module, a depth estimation network and a pose estimation network, wherein the target image frame extraction module is respectively connected with the depth estimation network and the pose estimation network;
the image preprocessing data in the training set is input into a target image frame extraction module, the target image frame extraction module respectively outputs an original target image frame and a source image frame, the original target image frame is input into a depth estimation network, the depth estimation network outputs a depth map, and the original target image frame, the source image frame, the aligned original inertial navigation sensor data in the training set and the corresponding inertial navigation sensor pre-integration data are input into a pose estimation network.
The pose estimation network comprises a convolutional neural network, a depth self-attention transformation network (Transformer network), a first multilayer perceptron and a second multilayer perceptron;
inputting an original target image frame and a source image frame into a convolutional neural network, inputting data of an aligned original inertial navigation sensor into a depth self-attention transformation network, inputting preprocessed data of the aligned inertial navigation sensor into a first multilayer perceptron, performing feature fusion on the outputs of the convolutional neural network, the depth self-attention transformation network and the first multilayer perceptron, and inputting the outputs of the convolutional neural network, the depth self-attention transformation network and the first multilayer perceptron into a second multilayer perceptron. The pose estimation network respectively performs feature learning and neural network parametric expression on an original target image frame, a source image frame, inertial navigation sensor data with time sequence attributes and pre-integral data thereof through a convolutional neural network, a deep self-attention transformation network and a first multilayer perceptron, and specifically performs feature learning on navigation parameters of the original target image frame and the source image frame through the convolutional neural network to obtain an N-dimensional visual feature vector; performing navigation parameter feature learning on the original inertial navigation sensor data with time sequence attributes through a deep self-attention transformation network to obtain an N-dimensional original inertial navigation data feature vector; the neural network parameterization expression is carried out on the inertial navigation sensor pre-integral data through the first multilayer perceptron, and the inertial navigation sensor pre-integral data is a relative 6-degree-of-freedom pose between two time points, so that the 6-dimensional pre-integral data can be directly expanded into an N-dimensional characterization vector through the multilayer perceptron of the B layer, and finally an N-dimensional pre-integral feature vector is obtained. And cascading the three feature vectors to obtain a total feature vector. Performing regression operation on the obtained total feature vector through a second multilayer perceptron, and mapping the total feature vector to pose parameters of 6 degrees of freedom, including three-dimensional acceleration and three-dimensional angular velocity;
source image frame F output by target image frame extraction module based on self-supervision visual inertial navigation odometer deep learning networksObtaining a reconstructed target image frame F 'after a relative pose of 6 degrees of freedom output by a pose estimation network, a depth map output by a depth estimation network and a camera internal reference matrix in a monocular vision inertial navigation odometer are subjected to view reconstruction by using a view reconstruction method'tThe concrete formula is as follows:
Figure BDA0003470437240000061
wherein, F'tRepresenting a reconstructed target image frame, K is an internal reference matrix of the camera,
Figure BDA0003470437240000062
the relative pose in 6 degrees of freedom of the network output is estimated for pose,
Figure BDA0003470437240000063
depth map p output for computing depth estimation networktDepth of each pixel point in, K-1Representing the inverse of the camera's intrinsic reference matrix, and using the bilinear sampling method since the pixel values calculated by this formula are continuousCarrying out bilinear interpolation on the reconstructed target image frame by the method, obtaining a reconstructed pixel value from surrounding pixels according to a proportion, and obtaining a reconstructed target image frame F'tTherefore, the error between the reconstructed target image frame and the original target image frame is optimized to restrict the training process of the whole network model. And finally, based on the reconstructed target image frame F'tAnd the original target image frame FtAnd calculating a total loss function by the original target image frame output by the extraction module, and training the self-supervision visual inertial navigation odometer deep learning network according to the total loss function. The total loss function includes a view reconstruction loss function and an edge perception smoothness loss function. The learning process of the self-supervision visual inertial navigation odometer deep learning network does not need any truth value participation, the total loss function is utilized to simultaneously constrain the deep estimation network and the pose estimation network, and the training of the whole network is completed. And coupling the two networks to complete the self-supervision training process by using the total loss function.
The image preprocessing data input in the target image frame extraction module is an image sequence, each image sequence comprises at least two images, and when the number of the images in each image sequence is 2, the image of the previous frame is taken as a source image frame FsThe image of the latter frame is taken as the original target image frame Ft(ii) a When the number of images of each image sequence is odd, the image of the intermediate frame is taken as the original target image frame FtThe images of the residual frames together being a source image frame Fs
As shown in fig. 3, the depth self-attention transformation network includes a position coding module, 6 coding modules and a third multi-layer perceptron, the aligned original inertial navigation sensor data is input into the position coding module, the output of the position coding module is added with the aligned original inertial navigation sensor data and then input into a first coding module, the first coding module is connected with a sixth coding module after passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module in sequence, the output of the sixth coding module is input into the third multi-layer perceptron, and the output of the third multi-layer perceptron is used as the output of the depth self-attention transformation network; the 6 coding modules have the same structure and comprise a multi-head attention mechanism module, a first residual error module, a first normalization layer, a feedforward neural network, a second residual error module and a second normalization layer; in each coding module, the input of the current coding module is respectively input into a multi-head attention mechanism module and a first residual error module of the current coding module, the output of the multi-head attention mechanism module and the output of the first residual error module are added and then input into a first normalization layer, the output of the first normalization layer is respectively input into a feedforward neural network and a second residual error module, the output of the feedforward neural network and the output of the second residual error module are added and then input into a second normalization layer, and the output of the second normalization layer is used as the output of the current coding module.
In specific implementation, in order to enhance the expressive ability of the network and prevent the network degradation caused by the deepening of the network, the coding part of the deep estimation network adopts a standard ResNet18 network. The depth estimation network only takes an original target image frame as input, and takes the original target image frame and a source image frame as input of a convolutional neural network part of the pose estimation network after being cascaded on a channel.
Since the raw inertial navigation sensor data has a dimension of 6, the input node of the deep self-attention translation network is set to 6.
In order to make the output of the attention layer contain richer coded representation information and enhance the expressive power of the model, the number of the attention heads of the multi-head attention mechanism module is set as n.
Through self-supervision training, the depth network and the pose estimation network can respectively output a depth map of a scene and a relative pose between two frames of images.
4) The method comprises the steps of collecting original image data and original inertial navigation sensor data of a target scene, performing pre-integration on the original inertial navigation sensor data to obtain inertial navigation sensor pre-integration data, processing the original image data and the original inertial navigation sensor data of the target scene and the corresponding inertial navigation sensor pre-integration data respectively to obtain normalized original inertial navigation sensor data, inertial navigation sensor pre-integration data and image pre-processing data of the target scene, inputting the normalized original inertial navigation sensor data, the normalized inertial navigation sensor pre-integration data and the image pre-processing data into a trained self-supervision visual inertial navigation odometer deep learning network, outputting a predicted relative pose of 6 degrees of freedom, and performing absolute scale processing on the predicted relative pose of 6 degrees of freedom by using a least square method based on the inertial navigation sensor pre-processing data of the target scene to obtain an absolute pose of 6 degrees of freedom of a camera.
The absolute pose of the 6 degrees of freedom of the camera in the step 4) is obtained by multiplying the absolute scale of the translation amount by the relative pose of the 6 degrees of freedom of the camera, wherein the translation amount X is subjected to least square methodiAnd the aligned inertial navigation sensor pre-integration data YiThe error between the two is optimized to obtain the absolute scale of the translation amount, and the calculation formula of the absolute scale of the translation amount is as follows:
Scale*=argmin||Yi-Scale*Xi||2
wherein, Scale*Representing the absolute scale of the final translation, the relative pose of 6 degrees of freedom includes the translation XiAnd amount of rotation, YiRepresenting the pre-integration data of the aligned inertial navigation sensor, argmin | | | | | non-woven cells2Represents the least squares operation and Scale represents the absolute Scale of the amount of translation during the calculation.
The test is carried out on the KITTI data set, and the pose estimation error is detailed in the following table 1. Method 1(SC-SfMlearner), method 2(Zhou et al), method 3(Zhan et al) are purely visual odometry methods based on learning that currently have excellent performance.
TABLE 1 pose estimation error tested on KITTI data set
Figure BDA0003470437240000081
It can be seen that after the method is adopted, the translation error and the rotation error of the model are obviously improved in the KITTI 09 sequence; over the KITTI 10 sequence, the rotation error is significantly improved. It should be noted that the performance of the method 1 is obtained under the condition of recovering the absolute scale by relying on truth, while the method of the invention does not participate in any truth and only relies on IMU data to recover the absolute scale of the pose.
Table 2 below is depth estimation error data tested on the KITTI data. The method 4(Monodepth2), the method 1 (SC-SfMleaner), the method 5(D3VO) and the method 6(PackNet-SfM) are all learning-based pure visual odometry methods with excellent depth estimation performance at present. In the table,. cervi indicates that the performance is better as the index is larger, and ↓indicatesthat the performance is better as the index is smaller.
TABLE 2 depth estimation error tested on KITTI data
Figure BDA0003470437240000091
Compared with the pure visual odometer based on learning with excellent performance at present, the method provided by the invention can effectively improve the performance of each index of deep estimation.

Claims (8)

1. A monocular vision inertial navigation positioning method based on self-supervision deep learning is characterized by comprising the following steps:
1) respectively acquiring original image data and original inertial navigation sensor data of each scene by using a camera and inertial navigation in the monocular vision inertial navigation odometer, and preprocessing the original image data of the scenes to obtain image preprocessing data; performing pre-integration on the original inertial navigation sensor data to obtain inertial navigation sensor pre-integration data, performing normalization processing on the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integration data of the current scene respectively, and obtaining the normalized original inertial navigation sensor data and the normalized inertial navigation sensor pre-integration data respectively;
2) respectively aligning the normalized original inertial navigation sensor data and inertial navigation sensor pre-integral data of the current scene with image preprocessing data according to a timestamp to respectively obtain the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data, and forming a training set by the image preprocessing data of a plurality of scenes, the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data;
3) inputting the training set into a deep learning network of the self-supervision visual inertial navigation odometer for training to obtain a well-trained deep learning network of the self-supervision visual inertial navigation odometer;
4) the method comprises the steps of collecting original image data and original inertial navigation sensor data of a target scene, performing pre-integration on the original inertial navigation sensor data to obtain inertial navigation sensor pre-integration data, processing the original image data and the original inertial navigation sensor data of the target scene and the corresponding inertial navigation sensor pre-integration data respectively to obtain normalized original inertial navigation sensor data, inertial navigation sensor pre-integration data and image pre-processing data of the target scene, inputting the normalized original inertial navigation sensor data, the normalized inertial navigation sensor pre-integration data and the image pre-processing data into a trained self-supervision visual inertial navigation odometer deep learning network, outputting a predicted relative pose of 6 degrees of freedom, and performing absolute scale processing on the predicted relative pose of 6 degrees of freedom by using a least square method based on the inertial navigation sensor pre-processing data of the target scene to obtain an absolute pose of 6 degrees of freedom of a camera.
2. The method for positioning monocular vision inertial navigation based on self-supervision deep learning according to claim 1, wherein in step 1), the image size of the original image data of the scene is clipped to a preset size, and image preprocessing data is obtained; and then respectively carrying out zero-mean normalization processing on the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data of the scene to respectively obtain the normalized original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data.
3. The monocular vision inertial navigation positioning method based on the self-supervision deep learning of claim 1, wherein the self-supervision vision inertial navigation odometer deep learning network in the step 3) is composed of a target image frame extraction module, a depth estimation network and a pose estimation network, and the target image frame extraction module is respectively connected with the depth estimation network and the pose estimation network;
the image preprocessing data in the training set is input into a target image frame extraction module, the target image frame extraction module respectively outputs an original target image frame and a source image frame, the original target image frame is input into a depth estimation network, and the original target image frame, the source image frame, the aligned original inertial navigation sensor data in the training set and the corresponding inertial navigation sensor pre-integration data are input into a pose estimation network.
4. The method for monocular vision inertial navigation positioning based on self-supervision deep learning according to claim 3, wherein the pose estimation network comprises a convolutional neural network, a deep self-attention transformation network, a first multi-layer perceptron and a second multi-layer perceptron;
the method comprises the steps that an original target image frame and a source image frame are input into a convolutional neural network, aligned original inertial navigation sensor data are input into a depth self-attention transformation network, aligned inertial navigation sensor preprocessing data are input into a first multilayer perceptron, the outputs of the convolutional neural network, the depth self-attention transformation network and the first multilayer perceptron are subjected to feature fusion and then input into a second multilayer perceptron, and the output of the second multilayer perceptron is used as the output of a pose estimation network.
5. The method according to claim 3, wherein the method comprises a step of performing self-supervised deep learning-based monocular vision inertial navigation positioning, the method is characterized in that after the self-supervision visual inertial navigation odometer deep learning network carries out view reconstruction by using a view reconstruction method based on a source image frame output by a target image frame extraction module, a relative pose with 6 degrees of freedom output by a pose estimation network, a depth map output by the depth estimation network and an internal parameter matrix of a camera in a monocular visual inertial navigation odometer, and finally, calculating a total loss function based on the predicted target image frame and the original target image frame output by the original target image frame extraction module, and training the self-monitoring visual inertial navigation odometer deep learning network according to the total loss function.
6. The monocular vision inertial navigation positioning method based on the self-supervision deep learning according to claim 3, characterized in that the image preprocessing data inputted in the target image frame extraction module is an image sequence, each image sequence comprises at least two images, when the number of the images of each image sequence is 2, the image of the previous frame is used as a source image frame, and the image of the next frame is used as an original target image frame; when the number of the images of each image sequence is odd, the images of the intermediate frames are used as original target image frames, and the images of the rest frames are used as source image frames together.
7. The method for positioning monocular vision inertial navigation based on self-supervision deep learning according to claim 4, wherein the depth self-attention transformation network comprises a position coding module, 6 coding modules and a third multilayer perceptron, the aligned original inertial navigation sensor data is input into the position coding module, the output of the position coding module is added with the aligned original inertial navigation sensor data and then input into the first coding module, the first coding module is connected with a sixth coding module after passing through the second coding module, the third coding module, the fourth coding module and the fifth coding module in sequence, the output of the sixth coding module is input into the third multilayer perceptron, and the output of the third multilayer perceptron is used as the output of the depth self-attention transformation network; the 6 coding modules have the same structure and comprise a multi-head attention mechanism module, a first residual error module, a first normalization layer, a feedforward neural network, a second residual error module and a second normalization layer; in each coding module, the input of the current coding module is respectively input into a multi-head attention mechanism module and a first residual module of the current coding module, the output of the multi-head attention mechanism module and the output of the first residual module are added and then input into a first normalization layer, the output of the first normalization layer is respectively input into a feedforward neural network and a second residual module, the output of the feedforward neural network and the output of the second residual module are added and then input into a second normalization layer, and the output of the second normalization layer is used as the output of the current coding module.
8. The method for monocular vision inertial navigation positioning based on self-supervision deep learning according to claim 1, wherein the absolute pose of 6 degrees of freedom of the camera in step 4) is obtained by multiplying the absolute scale of the translation amount by the relative pose of 6 degrees of freedom of the camera, wherein the calculation formula of the absolute scale of the translation amount is as follows:
Scale*=arg min||Yi-Scale*Xi||2
wherein, Scale*Representing the absolute scale of the final translation, the relative pose of 6 degrees of freedom includes the translation XiAnd amount of rotation, YiRepresenting the pre-integration data of the aligned inertial navigation sensor, argmin | | | | | non-woven cells2Represents the least squares operation and Scale represents the absolute Scale of the amount of translation during the calculation.
CN202210041486.8A 2022-01-14 2022-01-14 Monocular vision inertial navigation positioning method based on self-supervision deep learning Active CN114526728B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210041486.8A CN114526728B (en) 2022-01-14 2022-01-14 Monocular vision inertial navigation positioning method based on self-supervision deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210041486.8A CN114526728B (en) 2022-01-14 2022-01-14 Monocular vision inertial navigation positioning method based on self-supervision deep learning

Publications (2)

Publication Number Publication Date
CN114526728A true CN114526728A (en) 2022-05-24
CN114526728B CN114526728B (en) 2023-12-05

Family

ID=81620126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210041486.8A Active CN114526728B (en) 2022-01-14 2022-01-14 Monocular vision inertial navigation positioning method based on self-supervision deep learning

Country Status (1)

Country Link
CN (1) CN114526728B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114993306A (en) * 2022-08-04 2022-09-02 中国人民解放军国防科技大学 Scale self-recovery visual inertial integrated navigation method and device
CN116681759A (en) * 2023-04-19 2023-09-01 中国科学院上海微***与信息技术研究所 Camera pose estimation method based on self-supervision visual inertial odometer

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103292804A (en) * 2013-05-27 2013-09-11 浙江大学 Monocular natural vision landmark assisted mobile robot positioning method
CN108986166A (en) * 2018-07-20 2018-12-11 山东大学 A kind of monocular vision mileage prediction technique and odometer based on semi-supervised learning
CN110595466A (en) * 2019-09-18 2019-12-20 电子科技大学 Lightweight inertial-assisted visual odometer implementation method based on deep learning
CN110717927A (en) * 2019-10-10 2020-01-21 桂林电子科技大学 Indoor robot motion estimation method based on deep learning and visual inertial fusion
CN111311685A (en) * 2020-05-12 2020-06-19 中国人民解放军国防科技大学 Motion scene reconstruction unsupervised method based on IMU/monocular image
CN111325797A (en) * 2020-03-03 2020-06-23 华东理工大学 Pose estimation method based on self-supervision learning
CN111369608A (en) * 2020-05-29 2020-07-03 南京晓庄学院 Visual odometer method based on image depth estimation
US20200265590A1 (en) * 2019-02-19 2020-08-20 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning
US20210118184A1 (en) * 2019-10-17 2021-04-22 Toyota Research Institute, Inc. Systems and methods for self-supervised scale-aware training of a model for monocular depth estimation

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103292804A (en) * 2013-05-27 2013-09-11 浙江大学 Monocular natural vision landmark assisted mobile robot positioning method
CN108986166A (en) * 2018-07-20 2018-12-11 山东大学 A kind of monocular vision mileage prediction technique and odometer based on semi-supervised learning
US20200265590A1 (en) * 2019-02-19 2020-08-20 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning
CN110595466A (en) * 2019-09-18 2019-12-20 电子科技大学 Lightweight inertial-assisted visual odometer implementation method based on deep learning
CN110717927A (en) * 2019-10-10 2020-01-21 桂林电子科技大学 Indoor robot motion estimation method based on deep learning and visual inertial fusion
US20210118184A1 (en) * 2019-10-17 2021-04-22 Toyota Research Institute, Inc. Systems and methods for self-supervised scale-aware training of a model for monocular depth estimation
CN111325797A (en) * 2020-03-03 2020-06-23 华东理工大学 Pose estimation method based on self-supervision learning
CN111311685A (en) * 2020-05-12 2020-06-19 中国人民解放军国防科技大学 Motion scene reconstruction unsupervised method based on IMU/monocular image
CN111369608A (en) * 2020-05-29 2020-07-03 南京晓庄学院 Visual odometer method based on image depth estimation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
熊炜;金靖熠;王娟;刘敏;曾春艳;: "基于深度学习特征点法的单目视觉里程计", 计算机工程与科学, no. 01 *
苏健鹏;黄影平;赵柏淦;胡兴;: "基于深度卷积神经网络的视觉里程计研究", 光学仪器, no. 04 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114993306A (en) * 2022-08-04 2022-09-02 中国人民解放军国防科技大学 Scale self-recovery visual inertial integrated navigation method and device
CN114993306B (en) * 2022-08-04 2022-10-28 中国人民解放军国防科技大学 Scale self-recovery visual inertial integrated navigation method and device
CN116681759A (en) * 2023-04-19 2023-09-01 中国科学院上海微***与信息技术研究所 Camera pose estimation method based on self-supervision visual inertial odometer
CN116681759B (en) * 2023-04-19 2024-02-23 中国科学院上海微***与信息技术研究所 Camera pose estimation method based on self-supervision visual inertial odometer

Also Published As

Publication number Publication date
CN114526728B (en) 2023-12-05

Similar Documents

Publication Publication Date Title
CN107516326B (en) Robot positioning method and system fusing monocular vision and encoder information
CN107481292A (en) The attitude error method of estimation and device of vehicle-mounted camera
CN114526728B (en) Monocular vision inertial navigation positioning method based on self-supervision deep learning
CN106780631B (en) Robot closed-loop detection method based on deep learning
CN109242003B (en) Vehicle-mounted vision system self-motion determination method based on deep convolutional neural network
CN110533724B (en) Computing method of monocular vision odometer based on deep learning and attention mechanism
CN112556719B (en) Visual inertial odometer implementation method based on CNN-EKF
CN108648274A (en) A kind of cognition point cloud map creation system of vision SLAM
CN113392904B (en) LTC-DNN-based visual inertial navigation combined navigation system and self-learning method
CN114332394A (en) Semantic information assistance-based dynamic scene three-dimensional reconstruction method
CN114964276A (en) Dynamic vision SLAM method fusing inertial navigation
CN113313176A (en) Point cloud analysis method based on dynamic graph convolution neural network
CN117058474B (en) Depth estimation method and system based on multi-sensor fusion
Gao et al. Gyro-net: IMU gyroscopes random errors compensation method based on deep learning
CN112348854A (en) Visual inertial mileage detection method based on deep learning
CN116824433A (en) Visual-inertial navigation-radar fusion self-positioning method based on self-supervision neural network
CN116704032A (en) Outdoor visual SLAM method based on monocular depth estimation network and GPS
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device
CN113379787B (en) Target tracking method based on 3D convolution twin neural network and template updating
CN113848884B (en) Unmanned engineering machinery decision method based on feature fusion and space-time constraint
CN115100237A (en) Visual odometer method for inspection robot
CN112069997B (en) Unmanned aerial vehicle autonomous landing target extraction method and device based on DenseHR-Net
CN114743105A (en) Depth privilege visual odometer method based on cross-modal knowledge distillation
CN113160391A (en) Double-stage three-dimensional scene modeling method
CN114034312B (en) Light-weight multi-decoupling visual odometer implementation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant