CN114526728A

CN114526728A - Monocular vision inertial navigation positioning method based on self-supervision deep learning

Info

Publication number: CN114526728A
Application number: CN202210041486.8A
Authority: CN
Inventors: 项志宇; 刘磊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-24
Anticipated expiration: 2042-01-14
Also published as: CN114526728B

Abstract

The invention discloses a monocular vision inertial navigation positioning method based on self-supervision deep learning. The method comprises the steps of acquiring and processing data of each scene, and respectively obtaining image preprocessing data, normalized original inertial navigation sensor data and corresponding inertial navigation sensor pre-integral data; aligning data according to the time stamp and forming a training set; inputting the training set into a network for training to obtain a trained network; acquiring data of a target scene, processing the data to obtain data of the target scene, inputting the data into a network, outputting a predicted relative pose of 6 degrees of freedom, and performing absolute scale processing on the predicted relative pose of 6 degrees of freedom by using a least square method based on preprocessed data of an inertial navigation sensor of the target scene to obtain an absolute pose of 6 degrees of freedom of a camera. The invention improves the accuracy of pose estimation and depth estimation, does not need to carry out strict space-time calibration and initialization on the vision and inertial navigation sensors, and has strong robustness.

Description

Monocular vision inertial navigation positioning method based on self-supervision deep learning

Technical Field

The invention relates to a navigation positioning method of an intelligent vehicle, in particular to a monocular vision inertial navigation positioning method based on self-supervision deep learning.

Background

High-precision navigation and positioning are the core technologies of the automatic driving automobile. Conventional high-precision positioning means such as differential GPS and inertial navigation devices have large errors in some situations where GPS signals are poor. Secondly, the price of the high-precision inertial navigation equipment is often higher, and the high-precision inertial navigation equipment is not suitable for a commercial unmanned automobile scheme. In contrast, some semantic map-based matching positioning schemes are relatively cheap, but in consideration of errors of visual semantic perception and sparsity of semantic elements in the semantic map, such schemes cannot achieve positioning in any scene. A Visual Inertial odometer (Visual Inertial odometer) is a combined positioning method for fusing image Visual and Inertial device data. The combination of the camera and the cheap inertia device can effectively inhibit the drift of the inertia device, overcome the problems of scale loss, large influence of ambient light and the like in the visual odometer, and is an effective means for realizing low-cost high-precision positioning. However, the conventional visual inertial navigation odometer based on the physical model relies on robust initialization and strict calibration of the visual sensor and the inertial navigation sensor, and when the initialization fails, the calibration is inaccurate, the scene is fuzzy or the scene features are sparse, the positioning performance is obviously reduced. With the successful application of deep learning in the field of computer vision, the visual inertial navigation odometer based on the deep learning method is gradually concerned. However, the fully supervised visual odometer needs to provide a true pose value for training, which limits the application of the method.

Disclosure of Invention

In order to solve the problems in the background art, the invention aims to provide a monocular vision inertial navigation positioning method based on self-supervision deep learning, which is suitable for the fields of unmanned vehicles, robots and the like which need vision positioning.

The invention solves the problems of camera pose estimation and depth estimation of the visual inertial navigation combination by utilizing the parameter learning capability and the nonlinear model fitting capability of the neural network and utilizing the deep learning. Different sub-networks are utilized to respectively model the visual image, the inertial navigation sensor raw data and the pre-integration data thereof, and parameters of a network model are optimized through self-supervision training. In the testing stage, given the image data of the target scene and the corresponding inertial navigation sensor data, the pose with 6 degrees of freedom of absolute scale can be obtained.

The technical scheme adopted by the invention comprises the following steps:

1) respectively acquiring original image data and original inertial navigation sensor data of each scene by using a camera and inertial navigation in the monocular vision inertial navigation odometer, and preprocessing the original image data of the scenes to obtain image preprocessing data; performing pre-integration on the original inertial navigation sensor data to obtain inertial navigation sensor pre-integration data, performing normalization processing on the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integration data of the current scene respectively, and obtaining the normalized original inertial navigation sensor data and the normalized inertial navigation sensor pre-integration data respectively;

2) respectively aligning the normalized original inertial navigation sensor data and inertial navigation sensor pre-integral data of the current scene with image preprocessing data according to a timestamp to respectively obtain the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data, and forming a training set by the image preprocessing data of a plurality of scenes, the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data;

3) inputting the training set into a deep learning network of the self-supervision visual inertial navigation odometer for training to obtain a well-trained deep learning network of the self-supervision visual inertial navigation odometer;

4) the method comprises the steps of collecting original image data and original inertial navigation sensor data of a target scene, performing pre-integration on the original inertial navigation sensor data to obtain inertial navigation sensor pre-integration data, processing the original image data and the original inertial navigation sensor data of the target scene and the corresponding inertial navigation sensor pre-integration data respectively to obtain normalized original inertial navigation sensor data, inertial navigation sensor pre-integration data and image pre-processing data of the target scene, inputting the normalized original inertial navigation sensor data, the normalized inertial navigation sensor pre-integration data and the image pre-processing data into a trained self-supervision visual inertial navigation odometer deep learning network, outputting a predicted relative pose of 6 degrees of freedom, and performing absolute scale processing on the predicted relative pose of 6 degrees of freedom by using a least square method based on the inertial navigation sensor pre-processing data of the target scene to obtain an absolute pose of 6 degrees of freedom of a camera.

In the step 1), cutting the image size of the original image data of the scene to a preset size to obtain image preprocessing data; and then respectively carrying out zero-mean normalization processing on the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data of the scene to respectively obtain the normalized original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data.

The self-supervision visual inertial navigation odometer deep learning network in the step 3) is composed of a target image frame extraction module, a depth estimation network and a pose estimation network, wherein the target image frame extraction module is respectively connected with the depth estimation network and the pose estimation network;

the image preprocessing data in the training set is input into a target image frame extraction module, the target image frame extraction module respectively outputs an original target image frame and a source image frame, the original target image frame is input into a depth estimation network, and the original target image frame, the source image frame, the aligned original inertial navigation sensor data in the training set and the corresponding inertial navigation sensor pre-integration data are input into a pose estimation network.

The pose estimation network comprises a convolutional neural network, a depth self-attention transformation network, a first multilayer perceptron and a second multilayer perceptron;

the method comprises the steps that an original target image frame and a source image frame are input into a convolutional neural network, aligned original inertial navigation sensor data are input into a depth self-attention transformation network, aligned inertial navigation sensor preprocessing data are input into a first multilayer perceptron, the outputs of the convolutional neural network, the depth self-attention transformation network and the first multilayer perceptron are subjected to feature fusion and then input into a second multilayer perceptron, and the output of the second multilayer perceptron is used as the output of a pose estimation network.

The self-supervision visual inertial navigation odometer deep learning network carries out view reconstruction by using a view reconstruction method based on a source image frame output by a target image frame extraction module, a relative pose with 6 degrees of freedom output by a pose estimation network, a depth map output by the depth estimation network and an internal reference matrix of a camera in a monocular visual inertial navigation odometer, obtains a reconstructed target image frame, carries out bilinear interpolation on the reconstructed target image frame by using a bilinear sampling method to obtain a predicted target image frame, finally calculates a total loss function based on the predicted target image frame and an original target image frame output by an original target image frame extraction module, and trains the self-supervision visual inertial navigation odometer deep learning network according to the total loss function.

The image preprocessing data input in the target image frame extraction module is an image sequence, each image sequence comprises at least two images, when the number of the images of each image sequence is 2, the image of the previous frame is used as a source image frame, and the image of the next frame is used as an original target image frame; when the number of the images of each image sequence is odd, the images of the intermediate frames are used as original target image frames, and the images of the rest frames are used as source image frames together.

The depth self-attention transformation network comprises a position coding module, 6 coding modules and a third multilayer perceptron, the aligned original inertial navigation sensor data is input into the position coding module, the output of the position coding module is added with the aligned original inertial navigation sensor data and then input into a first coding module, the first coding module is connected with a sixth coding module after sequentially passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module, the output of the sixth coding module is input into the third multilayer perceptron, and the output of the third multilayer perceptron is used as the output of the depth self-attention transformation network; the 6 coding modules have the same structure and comprise a multi-head attention mechanism module, a first residual error module, a first normalization layer, a feedforward neural network, a second residual error module and a second normalization layer; in each coding module, the input of the current coding module is respectively input into a multi-head attention mechanism module and a first residual module of the current coding module, the output of the multi-head attention mechanism module and the output of the first residual module are added and then input into a first normalization layer, the output of the first normalization layer is respectively input into a feedforward neural network and a second residual module, the output of the feedforward neural network and the output of the second residual module are added and then input into a second normalization layer, and the output of the second normalization layer is used as the output of the current coding module.

The absolute pose of the 6 degrees of freedom of the camera in the step 4) is obtained by multiplying the absolute scale of the translation amount by the relative pose of the 6 degrees of freedom of the camera, wherein the calculation formula of the absolute scale of the translation amount is as follows:

Scale^*＝arg min||Y_i-Scale*X_i||²

wherein, Scale^*Representing the absolute scale of the final translation, the relative pose of 6 degrees of freedom includes the translation X_iAnd amount of rotation, Y_iRepresenting the pre-integration data of the aligned inertial navigation sensor, argmin | | | | | non-woven cells²Represents the least squares operation and Scale represents the absolute Scale of the amount of translation during the calculation.

The invention solves the problems of navigation positioning estimation and depth estimation of a visual inertial navigation combined sensor by utilizing the parameter learning and nonlinear model fitting capability of a neural network and utilizing self-supervision deep learning. Different networks such as a convolutional neural network, a deep self-attention transformation network, a multilayer perceptron and the like are utilized to respectively model the relationship between the visual image, the inertial navigation sensor raw data and the pre-integration data thereof and the navigation parameters, and the parameters of the model are optimized through long-time training. Due to addition of inertial navigation sensor data, under the conditions of only using monocular vision data and adopting self-supervision learning, the pose estimation and the scene depth estimation with 6 degrees of freedom of absolute scale can be finally obtained.

Compared with the background art, the invention has the beneficial effects that:

(1) the robustness of the monocular vision odometer method based on the self-supervision deep learning can be improved, and strict space-time calibration on the vision sensor and the inertial navigation sensor is not needed;

(2) according to the invention, the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integration data are introduced into the learning process of the network model, so that the pose estimation and depth estimation precision of the self-supervision deep learning monocular vision odometer are further improved;

(3) the invention can finally obtain the pose estimation and the scene depth estimation with 6 degrees of freedom of absolute scale under the conditions of only using monocular image data and adopting self-supervision learning.

In sum, compared with a pure monocular vision odometer based on self-supervision deep learning, the method provided by the invention has the advantages that the original inertial navigation sensor data and the inertial navigation sensor pre-integration data are increased, the accuracy of pose estimation and depth estimation is improved, and the robustness of the model is improved. Particularly, compared with the traditional vision and inertial navigation combined positioning method, the method does not need to carry out strict space-time calibration on the vision sensor and the inertial navigation sensor, and has strong robustness.

Drawings

FIG. 1 is a schematic diagram of the overall network framework of the method of the present invention;

FIG. 2 is a schematic diagram of inertial navigation sensor data pre-integration;

fig. 3 is a schematic diagram of a deep self-attention transformation network structure constructed by the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The embodiment and the implementation process of the complete method according to the invention content are as follows:

the invention comprises the following steps:

1) the monocular vision inertial navigation odometer mainly comprises a camera and inertial navigation, and a color camera and inertial navigation in the monocular vision inertial navigation odometer are used for respectively acquiring original image data and original inertial navigation sensor data of each scene, wherein the color camera acquires image data at relatively low frequency, and the inertial navigation sensor acquires acceleration and angular velocity data in three directions of the host vehicle or the robot at relatively high frequency synchronously; the original inertial navigation sensor data comprise 3-axis acceleration and 3-axis angular velocity of the inertial navigation sensor; the data acquisition frequency of the inertial navigation sensor is more than 10 times of the shooting frequency of the color camera. In specific implementation, the acquisition frequency of the color camera is 10Hz, and the acquisition frequency of the inertial navigation sensor is 100 Hz. Preprocessing original image data of a scene to obtain image preprocessing data; performing pre-integration on original inertial navigation sensor data IMU to obtain inertial navigation sensor pre-integration data, performing zero-mean normalization processing on the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integration data of the current scene respectively, and obtaining normalized original inertial navigation sensor data and inertial navigation sensor pre-integration data respectively;

in the step 1), in order to eliminate the influence of adverse factors, images and inertial navigation data acquired when dynamic objects are more in a scene and are seriously shielded or an autonomous vehicle stops are screened out, the image size of original image data of the scene is cut to a preset size, and image preprocessing data are obtained; in one embodiment, the predetermined size is 640 × 192. If the image size of the image data is too small, edge filling needs to be carried out on the image data, so that the image size of the image data is a preset size; as the data acquisition frequency of the inertial navigation sensor is far greater than the acquisition frequency of the image data, dozens of groups of inertial navigation data are corresponding between two adjacent frames of images, and each group of data corresponds to the constraint of one pose. In fact, only one constraint is needed between two consecutive images, and therefore, the inertial sensor data within the time instant of two consecutive images needs to be integrated, as shown in fig. 2. In order to avoid serious repeated calculation problems and waste of calculation resources caused by the change of the integral preceding term each time, pre-integral operation is carried out on the inertial navigation sensor data; in specific implementation, pre-integration processing is performed on 10 pieces of inertial navigation data corresponding to two adjacent frames of images.

In order to eliminate the influence of dimension of the inertial navigation sensor data, so that different data or indexes have comparability and the convergence of the network is ensured, zero-mean normalization processing needs to be performed on the original data and the pre-integral data before the original data and the pre-integral data are input into the network, and the normalized original inertial navigation sensor data and the corresponding pre-integral data of the inertial navigation sensor are obtained.

2) Aligning high-frequency normalized original inertial navigation sensor data and corresponding inertial navigation sensor pre-integral data of a current scene with low-frequency image preprocessing data according to a timestamp to obtain the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data respectively, and forming a training set by the image preprocessing data of a plurality of scenes, the aligned original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data;

as shown in fig. 1, the self-supervision visual inertial navigation odometer deep learning network in the step 3) is composed of a target image frame extraction module, a depth estimation network and a pose estimation network, wherein the target image frame extraction module is respectively connected with the depth estimation network and the pose estimation network;

the image preprocessing data in the training set is input into a target image frame extraction module, the target image frame extraction module respectively outputs an original target image frame and a source image frame, the original target image frame is input into a depth estimation network, the depth estimation network outputs a depth map, and the original target image frame, the source image frame, the aligned original inertial navigation sensor data in the training set and the corresponding inertial navigation sensor pre-integration data are input into a pose estimation network.

The pose estimation network comprises a convolutional neural network, a depth self-attention transformation network (Transformer network), a first multilayer perceptron and a second multilayer perceptron;

inputting an original target image frame and a source image frame into a convolutional neural network, inputting data of an aligned original inertial navigation sensor into a depth self-attention transformation network, inputting preprocessed data of the aligned inertial navigation sensor into a first multilayer perceptron, performing feature fusion on the outputs of the convolutional neural network, the depth self-attention transformation network and the first multilayer perceptron, and inputting the outputs of the convolutional neural network, the depth self-attention transformation network and the first multilayer perceptron into a second multilayer perceptron. The pose estimation network respectively performs feature learning and neural network parametric expression on an original target image frame, a source image frame, inertial navigation sensor data with time sequence attributes and pre-integral data thereof through a convolutional neural network, a deep self-attention transformation network and a first multilayer perceptron, and specifically performs feature learning on navigation parameters of the original target image frame and the source image frame through the convolutional neural network to obtain an N-dimensional visual feature vector; performing navigation parameter feature learning on the original inertial navigation sensor data with time sequence attributes through a deep self-attention transformation network to obtain an N-dimensional original inertial navigation data feature vector; the neural network parameterization expression is carried out on the inertial navigation sensor pre-integral data through the first multilayer perceptron, and the inertial navigation sensor pre-integral data is a relative 6-degree-of-freedom pose between two time points, so that the 6-dimensional pre-integral data can be directly expanded into an N-dimensional characterization vector through the multilayer perceptron of the B layer, and finally an N-dimensional pre-integral feature vector is obtained. And cascading the three feature vectors to obtain a total feature vector. Performing regression operation on the obtained total feature vector through a second multilayer perceptron, and mapping the total feature vector to pose parameters of 6 degrees of freedom, including three-dimensional acceleration and three-dimensional angular velocity;

source image frame F output by target image frame extraction module based on self-supervision visual inertial navigation odometer deep learning network_sObtaining a reconstructed target image frame F 'after a relative pose of 6 degrees of freedom output by a pose estimation network, a depth map output by a depth estimation network and a camera internal reference matrix in a monocular vision inertial navigation odometer are subjected to view reconstruction by using a view reconstruction method'_tThe concrete formula is as follows:

wherein, F'_tRepresenting a reconstructed target image frame, K is an internal reference matrix of the camera,

the relative pose in 6 degrees of freedom of the network output is estimated for pose,

depth map p output for computing depth estimation network_tDepth of each pixel point in, K^-1Representing the inverse of the camera's intrinsic reference matrix, and using the bilinear sampling method since the pixel values calculated by this formula are continuousCarrying out bilinear interpolation on the reconstructed target image frame by the method, obtaining a reconstructed pixel value from surrounding pixels according to a proportion, and obtaining a reconstructed target image frame F'_tTherefore, the error between the reconstructed target image frame and the original target image frame is optimized to restrict the training process of the whole network model. And finally, based on the reconstructed target image frame F'_tAnd the original target image frame F_tAnd calculating a total loss function by the original target image frame output by the extraction module, and training the self-supervision visual inertial navigation odometer deep learning network according to the total loss function. The total loss function includes a view reconstruction loss function and an edge perception smoothness loss function. The learning process of the self-supervision visual inertial navigation odometer deep learning network does not need any truth value participation, the total loss function is utilized to simultaneously constrain the deep estimation network and the pose estimation network, and the training of the whole network is completed. And coupling the two networks to complete the self-supervision training process by using the total loss function.

The image preprocessing data input in the target image frame extraction module is an image sequence, each image sequence comprises at least two images, and when the number of the images in each image sequence is 2, the image of the previous frame is taken as a source image frame F_sThe image of the latter frame is taken as the original target image frame F_t(ii) a When the number of images of each image sequence is odd, the image of the intermediate frame is taken as the original target image frame F_tThe images of the residual frames together being a source image frame F_s。

As shown in fig. 3, the depth self-attention transformation network includes a position coding module, 6 coding modules and a third multi-layer perceptron, the aligned original inertial navigation sensor data is input into the position coding module, the output of the position coding module is added with the aligned original inertial navigation sensor data and then input into a first coding module, the first coding module is connected with a sixth coding module after passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module in sequence, the output of the sixth coding module is input into the third multi-layer perceptron, and the output of the third multi-layer perceptron is used as the output of the depth self-attention transformation network; the 6 coding modules have the same structure and comprise a multi-head attention mechanism module, a first residual error module, a first normalization layer, a feedforward neural network, a second residual error module and a second normalization layer; in each coding module, the input of the current coding module is respectively input into a multi-head attention mechanism module and a first residual error module of the current coding module, the output of the multi-head attention mechanism module and the output of the first residual error module are added and then input into a first normalization layer, the output of the first normalization layer is respectively input into a feedforward neural network and a second residual error module, the output of the feedforward neural network and the output of the second residual error module are added and then input into a second normalization layer, and the output of the second normalization layer is used as the output of the current coding module.

In specific implementation, in order to enhance the expressive ability of the network and prevent the network degradation caused by the deepening of the network, the coding part of the deep estimation network adopts a standard ResNet18 network. The depth estimation network only takes an original target image frame as input, and takes the original target image frame and a source image frame as input of a convolutional neural network part of the pose estimation network after being cascaded on a channel.

Since the raw inertial navigation sensor data has a dimension of 6, the input node of the deep self-attention translation network is set to 6.

In order to make the output of the attention layer contain richer coded representation information and enhance the expressive power of the model, the number of the attention heads of the multi-head attention mechanism module is set as n.

Through self-supervision training, the depth network and the pose estimation network can respectively output a depth map of a scene and a relative pose between two frames of images.

The absolute pose of the 6 degrees of freedom of the camera in the step 4) is obtained by multiplying the absolute scale of the translation amount by the relative pose of the 6 degrees of freedom of the camera, wherein the translation amount X is subjected to least square method_iAnd the aligned inertial navigation sensor pre-integration data Y_iThe error between the two is optimized to obtain the absolute scale of the translation amount, and the calculation formula of the absolute scale of the translation amount is as follows:

Scale^*＝argmin||Y_i-Scale*X_i||²

The test is carried out on the KITTI data set, and the pose estimation error is detailed in the following table 1. Method 1(SC-SfMlearner), method 2(Zhou et al), method 3(Zhan et al) are purely visual odometry methods based on learning that currently have excellent performance.

TABLE 1 pose estimation error tested on KITTI data set

It can be seen that after the method is adopted, the translation error and the rotation error of the model are obviously improved in the KITTI 09 sequence; over the KITTI 10 sequence, the rotation error is significantly improved. It should be noted that the performance of the method 1 is obtained under the condition of recovering the absolute scale by relying on truth, while the method of the invention does not participate in any truth and only relies on IMU data to recover the absolute scale of the pose.

Table 2 below is depth estimation error data tested on the KITTI data. The method 4(Monodepth2), the method 1 (SC-SfMleaner), the method 5(D3VO) and the method 6(PackNet-SfM) are all learning-based pure visual odometry methods with excellent depth estimation performance at present. In the table,. cervi indicates that the performance is better as the index is larger, and ↓indicatesthat the performance is better as the index is smaller.

TABLE 2 depth estimation error tested on KITTI data

Compared with the pure visual odometer based on learning with excellent performance at present, the method provided by the invention can effectively improve the performance of each index of deep estimation.

Claims

1. A monocular vision inertial navigation positioning method based on self-supervision deep learning is characterized by comprising the following steps:

2. The method for positioning monocular vision inertial navigation based on self-supervision deep learning according to claim 1, wherein in step 1), the image size of the original image data of the scene is clipped to a preset size, and image preprocessing data is obtained; and then respectively carrying out zero-mean normalization processing on the original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data of the scene to respectively obtain the normalized original inertial navigation sensor data and the corresponding inertial navigation sensor pre-integral data.

3. The monocular vision inertial navigation positioning method based on the self-supervision deep learning of claim 1, wherein the self-supervision vision inertial navigation odometer deep learning network in the step 3) is composed of a target image frame extraction module, a depth estimation network and a pose estimation network, and the target image frame extraction module is respectively connected with the depth estimation network and the pose estimation network;

4. The method for monocular vision inertial navigation positioning based on self-supervision deep learning according to claim 3, wherein the pose estimation network comprises a convolutional neural network, a deep self-attention transformation network, a first multi-layer perceptron and a second multi-layer perceptron;

5. The method according to claim 3, wherein the method comprises a step of performing self-supervised deep learning-based monocular vision inertial navigation positioning, the method is characterized in that after the self-supervision visual inertial navigation odometer deep learning network carries out view reconstruction by using a view reconstruction method based on a source image frame output by a target image frame extraction module, a relative pose with 6 degrees of freedom output by a pose estimation network, a depth map output by the depth estimation network and an internal parameter matrix of a camera in a monocular visual inertial navigation odometer, and finally, calculating a total loss function based on the predicted target image frame and the original target image frame output by the original target image frame extraction module, and training the self-monitoring visual inertial navigation odometer deep learning network according to the total loss function.

6. The monocular vision inertial navigation positioning method based on the self-supervision deep learning according to claim 3, characterized in that the image preprocessing data inputted in the target image frame extraction module is an image sequence, each image sequence comprises at least two images, when the number of the images of each image sequence is 2, the image of the previous frame is used as a source image frame, and the image of the next frame is used as an original target image frame; when the number of the images of each image sequence is odd, the images of the intermediate frames are used as original target image frames, and the images of the rest frames are used as source image frames together.

7. The method for positioning monocular vision inertial navigation based on self-supervision deep learning according to claim 4, wherein the depth self-attention transformation network comprises a position coding module, 6 coding modules and a third multilayer perceptron, the aligned original inertial navigation sensor data is input into the position coding module, the output of the position coding module is added with the aligned original inertial navigation sensor data and then input into the first coding module, the first coding module is connected with a sixth coding module after passing through the second coding module, the third coding module, the fourth coding module and the fifth coding module in sequence, the output of the sixth coding module is input into the third multilayer perceptron, and the output of the third multilayer perceptron is used as the output of the depth self-attention transformation network; the 6 coding modules have the same structure and comprise a multi-head attention mechanism module, a first residual error module, a first normalization layer, a feedforward neural network, a second residual error module and a second normalization layer; in each coding module, the input of the current coding module is respectively input into a multi-head attention mechanism module and a first residual module of the current coding module, the output of the multi-head attention mechanism module and the output of the first residual module are added and then input into a first normalization layer, the output of the first normalization layer is respectively input into a feedforward neural network and a second residual module, the output of the feedforward neural network and the output of the second residual module are added and then input into a second normalization layer, and the output of the second normalization layer is used as the output of the current coding module.

8. The method for monocular vision inertial navigation positioning based on self-supervision deep learning according to claim 1, wherein the absolute pose of 6 degrees of freedom of the camera in step 4) is obtained by multiplying the absolute scale of the translation amount by the relative pose of 6 degrees of freedom of the camera, wherein the calculation formula of the absolute scale of the translation amount is as follows:

Scale^*＝arg min||Y_i-Scale*X_i||²