CN114485637A

CN114485637A - Visual and inertial mixed pose tracking method of head-mounted augmented reality system

Info

Publication number: CN114485637A
Application number: CN202210055655.3A
Authority: CN
Inventors: 何宁; 晁建刚; 许振瑛; 胡福超; 黄冉; 佘佳宏; 郭俊鹏; 陈炜
Original assignee: 63919 Troops of PLA
Current assignee: 63919 Troops of PLA
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-05-13

Abstract

The invention discloses a visual and inertial mixed pose tracking method of a head-wearing augmented reality system, which comprises the steps of firstly carrying out equipment initialization, carrying out measurement preprocessing on a visual sensor and an inertial measurement device, processing frame images acquired by the visual sensor, simultaneously carrying out pre-integration on inertial measurement results corresponding to two continuous frame images, carrying out combined calibration on the visual sensor and the inertial measurement device, fusing the pre-integrated inertial measurement results, visual image characteristic observation results and loop detection results by utilizing measurement data acquired by the visual sensor and the inertial measurement device, carrying out fusion positioning on the head-wearing augmented reality system, carrying out loop detection on the positioning results, and carrying out global pose optimization by utilizing a repositioning result to obtain a final pose estimation result. The invention improves the intelligence and the usability of the augmented reality operation assistance of the astronaut and effectively improves the training efficiency of the astronaut.

Description

Visual and inertial mixed pose tracking method of head-mounted augmented reality system

Technical Field

The invention relates to the field of manned space flight, in particular to a visual and inertial mixed pose tracking method of a head-mounted augmented reality system.

Background

In the training process of the astronaut, the head movement of the astronaut is tracked, the sight direction is further obtained, and the method has very important significance for the evaluation of the operation behavior and the action of the astronaut. In the training process, in order to reappear an aerospace operation scene, an astronaut generally needs to wear the head-mounted augmented reality system, so that the position and pose of the head-mounted augmented reality system are estimated, the position and pose data of the head-mounted augmented reality system are obtained, the training process of an astronaut operation device can be effectively supported, and a basis is provided for the training effect evaluation of the astronaut operation device. In the existing real-time estimation method of the pose of the equipment, an optical method is easy to be shielded, the estimation effect is limited under the condition of poor light, and the accurate state of a system cannot be directly obtained due to the limitation of a sensor by a visual inertial system, so that the processes of initialization, external calibration, nonlinear optimization and the like of a system state estimator are difficult to realize in a static state. The inertial device IMU has higher output frequency and can output the pose state in a short period, but the pose estimation precision depends on the calibration process, and if the inertial device IMU cannot be effectively calibrated, the estimation precision is influenced.

Disclosure of Invention

The invention discloses a visual and inertial mixed pose tracking method of a head-mounted augmented reality system, aiming at the pose tracking problem of the head-mounted augmented reality system worn by astronauts in the training process.

The vision and inertia hybrid tracking method uses a vision sensor and an Inertia Measurement Unit (IMU) as a minimum sensor combination, and estimates the six-degree-of-freedom (DoF) state of a head-mounted augmented reality system by forming a vision inertia system. The visual inertial system cannot directly obtain the accurate state of the system due to the limitation of a sensor, so that the processes of initialization, external calibration, nonlinear optimization and the like of a system state estimator are difficult to realize in a static state. The six-degree-of-freedom state estimation of the head-mounted augmented reality system is realized by fusing visual feature observation data and pre-integral IMU measurement data, the system firstly starts the initialization and fault recovery process of an estimator, starts monocular VIO after the initialization is finished to obtain accurate visual inertia estimation, adopts a circular detection mode to quickly calculate repositioning, and uses a four-degree-of-freedom attitude diagram to optimize and improve the global consistency of the bitmap estimation. Finally, the system estimates the improvement in performance by comparison to a visual-only SLAM.

The invention discloses a visual and inertial mixed pose tracking method of a head-mounted augmented reality system, which comprises the steps of firstly carrying out equipment initialization, including measurement preprocessing on a visual sensor and an inertial measurement device, carrying out combined calibration on the visual sensor and the inertial measurement device, then carrying out fusion positioning on the head-mounted augmented reality system by using measurement data acquired by the visual sensor and the inertial measurement device, and carrying out loop detection and global pose graph optimization on a positioning result to obtain a final pose estimation result.

The measurement preprocessing process comprises processing frame images acquired by a vision sensor and performing pre-integration on inertial measurement results corresponding to two continuous frame images;

the method comprises the steps of carrying out combined calibration on a vision sensor and an inertia measurement device, calibrating the inertia measurement device to obtain an initial estimation result of a target attitude, a speed, a gravity vector, a gyroscope bias and a three-dimensional characteristic position, guiding a Visual-Inertial odometer VIO (VIO) based on nonlinear optimization, and calculating an absolute scale factor, the gyroscope bias, an acceleration bias, a gravity acceleration and a relative speed of the inertia measurement device to the vision sensor at each measurement moment.

The fusion positioning process fuses the inertia measurement result of the pre-integration and the visual image characteristic observation result, and realizes the fusion positioning of the head-wearing augmented reality system.

And the global pose graph is optimized, and the relocation result is utilized to carry out global optimization so as to eliminate drift.

In the equipment initialization process, firstly, a visual SFM algorithm based on a sliding window is initialized, then the pose change of the head-mounted augmented reality system is calculated, and the IMU pre-integration result is matched with the visual SFM algorithm result to restore the scale, gravity, speed and deviation so as to guide the VIO based on nonlinear optimization.

The visual SFM algorithm based on the sliding window is initialized, and comprises characteristic point tracking, camera pose estimation and optimization, for the characteristic point tracking, the characteristic corresponding relation between the latest frame image of the image collected by the head-mounted augmented reality system and all the frame images in the sliding window is checked, if the latest frame image and any frame image in the sliding window have more than a certain number of tracking characteristics and rotation compensation pixels, the latest frame image is subjected to relative rotation and scale translation by using the tracking characteristics and the rotation compensation pixels, all the characteristics in the two frames are triangulated, and if the latest frame image does not exist, the latest frame image is stored in the sliding window, and a new frame image is waited; for camera pose estimation, estimating the poses of all frame images in a sliding window by utilizing the triangulated features and adopting a PnP method to further obtain the camera poses; and the optimization is realized by minimizing the total weight projection error of all the feature observations by using a BA optimization method, so that the optimization estimation of the camera attitude is completed. P is the position coordinates of the object in the real space corresponding to the feature points of the left and right frame images respectively acquired by the left eye camera and the right eye camera, and is called as the feature position,characteristic point p₁、p₂The projection of the real object on the left eye camera and the right eye camera respectively, and O1 and O2 are the centers of the left eye camera and the right eye camera respectively. P is represented by the formula P ═ X, Y, Z]^TAccording to the pinhole camera model, then the feature point p₁、p₂Satisfies the relationship:

s₁p₁＝K₁P

s₂p₂＝K₂(RP+t)，

wherein, K₁、K₂Reference matrix of left eye and right eye cameras respectively, R, t rigid body conversion matrix and bias matrix between left eye and right eye cameras respectively, s₁、s₂The least square solution of P is calculated for the proportionality coefficients of the left eye camera and the right eye camera respectively as follows:

thereby obtaining a feature location estimation result. And according to the feature point matching information and the feature position estimation, acquiring the three-dimensional space position of the current feature point and the projection position of the current feature point on the camera by utilizing an SFM algorithm, and then estimating the image posture by directly utilizing a PnP algorithm.

The calibration of the inertial measurement equipment comprises the calibration of the bias of the gyroscope, the optimization of gravity vectors, and the calibration of speed, gravity vectors and scales. The gyroscope bias is calibrated by considering two continuous frame images b in the image window_kAnd b_k+1And obtaining the corresponding rotation compensation pixel by using visual SFM algorithm

And

will frame image b_k+1Time-to-frame image b_kThe offset of the inertial measurement at that time is recorded as

Recording the result of the pre-integration of the corresponding inertial measurement results as

Linearizing a pre-integral term about the gyroscope bias in the inertial measurement equipment, and constructing a minimization loss function as follows:

wherein, B is the set of all frame image sequence numbers in the image window. Using biased jacobian matrices

To obtain

Solving the minimum loss function to obtain the gyroscope bias b by the first-order approximation of the gyroscope bias and delta as a calibration coefficient_ωInitial calibration of (1).

After the gyro is biased and initialized, the speed, the gravity vector and the actual scale are continuously calculated, and the quantity to be calculated is expressed as:

wherein,

is the velocity of the target in the body coordinate system when the image of the k-th frame is acquired,

is to obtain the c₀Measured in time of frame imageS is a scaling factor that scales the visual SFM algorithm to metric units.

The combined calibration is carried out on the vision sensor and the inertia measuring equipment, C_objRepresenting an object coordinate system, and regarding the object coordinate system as a world coordinate system; c_cRepresenting a vision sensor coordinate system; c_iRepresenting the inertial measurement unit coordinate system. C can be obtained by a visual sensor calibration method_cAnd C_objRelative position of (2), here by R_aAnd t_aRepresenting the pose and 3 translation basis components of the vision sensor relative to the world coordinate system, through C_iAnd C_cRelative position t of the middle_xObtaining C_iAnd C_objRelative position R of_x. The calibration of the vision sensor and inertial measurement unit is carried out by observing a known calibration reference in space at different positions to derive R_xAnd t_xThe relationship of the results was observed several times.

The method is used for jointly calibrating a visual sensor and an inertial measurement unit, and needs to obtain multi-dimensional parameter information from an inertial measurement unit, and the implementation process comprises the following steps: firstly, directly acquiring 3D information of the three-dimensional object, namely directly acquiring a set of target attitude angles represented by three Euler angles from an inertial measurement device, and angular velocity and linear acceleration of raw data, so as to acquire required target attitude time t_xThen, assume the target initial position is zero, i.e. the initial position s₀0, target position s after time Deltat of initial velocity v₁By a linear formula s₁＝vΔt+1/2·aΔt²And estimating, wherein a is target acceleration, and the multidimensional parameter information comprises a target attitude angle, an original data angular velocity, linear acceleration, target attitude time and a target position.

The method comprises the following steps of utilizing measurement data acquired by a vision sensor and inertial measurement equipment to perform fusion positioning on a head-mounted augmented reality system, wherein the fusion positioning comprises measurement preprocessing, vision and inertial data fusion, repositioning and global attitude map optimization.

The measurement preprocessing comprises visual front-end processing and IMU data pre-integration.

And (3) visual front-end processing, tracking the feature points of each frame of image through an LK sparse optical flow algorithm, simultaneously detecting and obtaining the corner feature points of each frame of image, keeping acquiring 300 features from each frame of image, and enabling the features in each frame of image to be uniformly distributed by setting the minimum distance of pixels between two adjacent features.

For the corner feature point, the detection is realized by using a second derivative matrix of the image gray intensity, wherein the second derivative matrix of the image gray intensity is expressed as:

wherein p represents a pixel point in the image, and I represents the gray scale intensity at the pixel point p. The method comprises the following steps of obtaining a second derivative image by calculating the second derivative of the gray intensity of all pixel points in a frame of image, and calculating the autocorrelation matrix of the second derivative image in a fixed window around each pixel point to obtain:

wherein, w_i,jIs the normalized weight ratio of the pixels of the ith row and jth column within the fixed window to produce a circular window and a Gaussian weight, K is the number of pixels per row or per column in the fixed window, I_x(x, y) and I_y(x, y) represents the gray intensity of two pixels used to calculate the correlation at the x-th row and y-th column of the fixed window around it. And calculating the characteristic value of the autocorrelation matrix at each pixel point, and if the minimum characteristic value is still greater than the set minimum threshold value, taking the pixel point as the corner characteristic point.

Tracking the characteristic points of each frame of image through an LK sparse optical flow algorithm, and selecting a key frame image by adopting an average parallax standard or a tracking quality standard; in the average parallax criterion, if the average parallax of the tracking features between the current frame image and the latest key frame image exceeds 10 °, the current frame image is taken as a new key frame image, and the image rotation is compensated by using the short-term integral value of the measurement result of the inertial measurement device in calculating the average parallax. In the tracking quality criterion, if the number of tracking features is below a certain set value, the current frame image is taken as a new key frame image.

The IMU data pre-integration is characterized in that an IMU integration model is utilized, an IMU pre-integration algorithm is adopted, and an observed value of inertia measurement data, a covariance matrix of residual errors and a Jacobian matrix of the residual errors are calculated, wherein the observed value is used for calculating the residual errors, the Jacobian matrix of the residual errors is a gradient direction for optimizing data fusion, and the covariance matrix is a weight value corresponding to the observed value and is used for fusing the inertia measurement data and visual data.

The IMU integral model directly obtains the inertia data measured value in a body coordinate system, so that the measured value contains gravity information and the actual inertia motion information of the head-mounted augmented reality system and is subjected to acceleration bias b_aGyro offset b_ωAnd the effects of additive noise.

Combining with image frames b_kAnd b_k+1Corresponding to two time points, calculating the time interval t by inertia measurement_k,t_k+1]Position, speed and orientation states of the internal head-mounted augmented reality system in a world coordinate system.

The IMU pre-integration algorithm only compares the linear acceleration after changing the reference coordinate system from the world coordinate system to the body coordinate system

And angular velocity

The relevant variables are pre-integrated. The visual and inertial data fusion process described, using two successive frame images b in a sliding window_kAnd b_k+1According to the IMU measurement model, the residual error of the pre-integrated inertial measurement data is calculated by using the accelerometer and gyroscope measurement values containing noise in the time interval between two continuous frames of images.

The visual and inertial data fusion process adopts a marginalization method to reduce the computational complexity of the VIO based on the nonlinear optimization, namely, the state quantity x of the inertial measurement data is selectively used_kAnd a characteristic quantity lambda_lMarginalizing from the sliding window while converting quantities corresponding to the edge states into a priori data, so that spatially separated keyframes remain in the sliding window while providing sufficient disparity for feature triangulation. When the next new frame image is the key frame image, the frame is retained in the sliding window, and the corresponding measurement value of the earliest frame image is marginalized. If the next new frame image is a non-key frame image, only the inertial measurement data is retained and the visual measurement data is discarded, and the inertial measurement data is associated with this non-key frame image. The marginalization calculation is done with Schur compensation and a new a priori value is constructed with all marginalized measurement data related to the removed state and added to the existing a priori values.

In the repositioning process, loop detection is realized by using a bag-of-words model, the bag-of-words model is firstly established according to feature point data actually acquired in a space operating environment, when the position of a visual sensor changes, a new feature point set is obtained by real-time calculation, the feature point set is matched with a key frame feature description vector after forming a new description vector, and if the matching is successful, loop detection is realized. Recognizing the visited positions by adopting a bag-of-words model, then establishing a characteristic level connection between a candidate closed loop and a current frame image, and integrating a characteristic pair into a visual-inertial odometer VIO based on nonlinear optimization to obtain a drift-free target state estimation. By matching a feature point set obtained at a certain position with a bag of words, features at the position are described as

A＝a₁w₁+a₂w₂+a₃w₃+...+a_nw_n，

Wherein w_iDenotes the i-th feature vector in the bag of words, a_iIndicates the frequency of occurrence of the feature, [ a ]₁,a₂,...,a_n]Is just a descriptionVectors, using the L1 norm versus the distance between description vectors, i.e.

Wherein a and b are description vectors to be compared, and W is the dimension of the description vectors.

The bag-of-words model is constructed by adopting a clustering method, and the bag-of-words is expressed by using a k-ary tree. The method comprises the steps of utilizing feature point data obtained by actually collecting all features obtained in a preliminary test, and randomly selecting a plurality of feature point data from the feature point data to serve as samples for clustering. The clustering method is realized by adopting a K-Mean algorithm, the clustering number K is determined, K D-dimensional feature point data vectors are randomly selected as initial class centers, and D is the dimension of the feature point data; calculating the distance between each feature point data vector and a clustering center, and selecting the closest clustering center as the class to which the feature point data vector belongs; within the same class, the cluster center is recalculated, and the calculation and classification processes are iterated until the sample data is converged. Loop back detection is implemented using the TF-IDF algorithm.

When a loop is detected, the feature descriptors of BRIEF are matched, feature consistency is checked, and connection between a local sliding window and loop candidates is established. Removing outliers by adopting a geometric outlier removing method, performing a basic matrix test by using a RANSAC algorithm, and performing the basic matrix test by using 2D feature retrieval in a current image and a candidate loop; PnP position detection is performed through a RANSAC algorithm, and PnP position detection is performed by using a 3D position with known characteristics in a local sliding window and a 2D projection in a candidate loop image.

The global pose graph is optimized, and after relocation, a local sliding window is moved and aligned with a past position. With the relocation result, registration of the past set of poses into a globally consistent configuration is ensured by global pose graph optimization. When the key frame is rimmed from the sliding window, adding the rimmed key frame into the pose graph, and taking the rimmed key frame as a new vertex in the pose graph, and connecting the new vertex with other vertexes in the pose graph through a sequence edge and/or a loop edge: the newly rimmed key frames will establish several sequence edges to associate themselves with the existing key frames, the sequence edges representing the relative transition between two key frames in the local sliding window, the values of which are taken directly from the VIO.

If the newly rimmed key frame has a loop connected with the existing key frame, the loop edge in the pose graph is connected with the loop frame, the loop edge only contains 4-degree-of-freedom relative pose transformation, and the value of the loop edge is obtained by using the repositioning result.

The beneficial effects of the invention include: the invention can carry out accurate pose estimation on the head-mounted augmented reality system, obtain the pose data thereof, effectively support the training process of the operating equipment of the astronaut and provide a basis for the evaluation of the training effect thereof. The invention can effectively improve the intelligence and the usability of the augmented reality operation assistance of the astronaut and effectively improve the training efficiency of the astronaut.

Drawings

FIG. 1 is a flow chart of an implementation of the method of the present invention;

FIG. 2 is a schematic diagram of the visual inertial alignment process for estimator initialization of the present invention;

FIG. 3 is a schematic view of the epipolar geometry constraint of the present invention;

FIG. 4 is a schematic diagram of the gravity 2 degree-of-freedom parameterization of the present invention;

FIG. 5 is a schematic diagram of relative positions of coordinate systems before and after movement according to the present invention;

FIG. 6 is a schematic view of two movements of the vision sensor and inertial measurement unit platform of the present invention;

FIG. 7 is a graph of visual residuals on a unit sphere according to the present invention;

FIG. 8 is a schematic diagram of the marginalization strategy of the present invention;

FIG. 9 is a schematic view of a sliding window monocular VIO relocation of the present invention;

FIG. 10 is a schematic view of a 5-level 10-pronged tree descriptor bag of the present invention;

FIG. 11 is a schematic diagram of the repositioning and pose graph optimization process of the present invention.

Detailed Description

For a better understanding of the present disclosure, two examples are given herein.

FIG. 2 is a flow chart of an implementation of the method of the present invention; FIG. 2 is a schematic diagram of the visual inertial alignment process for estimator initialization of the present invention; FIG. 3 is a schematic view of the epipolar geometry constraint of the present invention; FIG. 4 is a schematic diagram of the gravity 2 degree-of-freedom parameterization of the present invention; FIG. 5 is a schematic diagram of relative positions of coordinate systems before and after movement according to the present invention; FIG. 6 is a schematic view of two movements of the vision sensor and inertial measurement unit platform of the present invention; FIG. 7 is a graph of the visual residual on a unit sphere of the present invention; FIG. 8 is a schematic diagram of the marginalization strategy of the present invention; FIG. 9 is a schematic view of a sliding window monocular VIO relocation of the present invention; FIG. 10 is a schematic view of a 5-level 10-pronged tree descriptor bag of the present invention; FIG. 11 is a schematic diagram of the repositioning and pose graph optimization process of the present invention.

The first embodiment is as follows:

The measurement preprocessing process comprises processing frame images acquired by a visual sensor and simultaneously performing pre-integration on inertial measurement results corresponding to two continuous frame images;

The fusion positioning process fuses the inertia measurement result of the pre-integration, the visual image feature observation result and the loop detection result, and realizes the fusion positioning of the head-wearing augmented reality system.

In the equipment initialization process, firstly, a visual SFM algorithm based on a sliding window is initialized, then the pose change of the head-mounted augmented reality system is calculated, and the IMU pre-integration result is matched with the visual SFM algorithm result to recover the scale, gravity, speed and deviation so as to guide the VIO based on nonlinear optimization.

The visual SFM algorithm based on the sliding window is initialized, and comprises characteristic point tracking, camera pose estimation and optimization, for the characteristic point tracking, the characteristic corresponding relation between the latest frame image of the image collected by the head-mounted augmented reality system and all the frame images in the sliding window is checked, if the latest frame image and any frame image in the sliding window have more than a certain number of tracking characteristics and rotation compensation pixels, the latest frame image is subjected to relative rotation and scale translation by using the tracking characteristics and the rotation compensation pixels, all the characteristics in the two frames are triangulated, and if the latest frame image does not exist, the latest frame image is stored in the sliding window, and a new frame image is waited; for camera pose estimation, estimating the poses of all frame images in a sliding window by utilizing the triangulated features and adopting a PnP method to further obtain the camera poses; and the optimization is realized by minimizing the total weight projection error of all the feature observations by using a BA optimization method, so that the optimization estimation of the camera attitude is completed. P is the position coordinates of the real object in the real space corresponding to the feature points of the left and right frame images respectively acquired by the left eye camera and the right eye camera, and is called as the feature position, and the feature point P₁、p₂The projection of the real object on the left eye camera and the right eye camera respectively, and O1 and O2 are the centers of the left eye camera and the right eye camera respectively. P is represented by the formula P ═ X, Y, Z]^TBased on pinhole camerasModel, feature points p₁、p₂Satisfies the relationship:

s₁p₁＝K₁P

s₂p₂＝K₂(RP+t)，

The calibration of the inertial measurement equipment comprises the calibration of the bias of the gyroscope, the optimization of gravity vectors, and the calibration of speed, gravity vectors and scales. The gyroscope bias is calibrated by considering two continuous frame images b in the image window_kAnd b_k+1Obtaining the corresponding rotation compensation pixel by using visual SFM algorithm

And

To obtain

wherein,

is to obtain the c₀The gravity vector measured during the frame of the image, s is the scaling factor for scaling the visual SFM algorithm to metric units.

The combined calibration is carried out on the vision sensor and the inertia measuring equipment, C_objRepresenting objectsA coordinate system, wherein the object coordinate system is taken as a world coordinate system; c_cRepresenting a vision sensor coordinate system; c_iRepresenting the inertial measurement unit coordinate system. C can be obtained by a visual sensor calibration method_cAnd C_objRelative position of (2), here by R_aAnd t_aRepresenting the pose and 3 translation basis components of the vision sensor relative to the world coordinate system, through C_iAnd C_cRelative position t of the middle_xObtaining C_iAnd C_objRelative position R of_x. The calibration of the vision sensor and inertial measurement unit is carried out by observing a known calibration reference in space at different positions to derive R_xAnd t_xThe relationship of the results was observed several times.

The visual sensor and the inertial measurement unit are jointly calibrated, multi-dimensional parameter information needs to be obtained from an inertial measurement unit, and the implementation process comprises the following steps: firstly, directly acquiring 3D information of the three-dimensional object, namely directly acquiring a set of target attitude angles represented by three Euler angles from an inertial measurement device, and angular velocity and linear acceleration of raw data, so as to acquire required target attitude time t_xThen, assume the target initial position is zero, i.e. the initial position s₀0, target position s after time Deltat of initial velocity v₁By a linear formula s₁＝vΔt+1/2·aΔt²And estimating, wherein a is target acceleration, and the multidimensional parameter information comprises a target attitude angle, an original data angular velocity, linear acceleration, target attitude time and a target position.

wherein, w_i,jIs the normalized weight ratio of the pixels of the ith row and jth column within a fixed window to produce a circular window and Gaussian weights, K is the number of pixels per row or per column in the fixed window, I_x(x, y) and I_y(x, y) represents the gray intensity of two pixels used to calculate the correlation at the x-th row and y-th column of the fixed window around it. And calculating the characteristic value of the autocorrelation matrix at each pixel point, and if the minimum characteristic value is still greater than the set minimum threshold value, taking the pixel point as the corner characteristic point.

And angular velocity

The relevant variables are pre-integrated.

The visual and inertial data fusion process described, using two successive frame images b in a sliding window_kAnd b_k+1According to the IMU measurement model, the residual error of the pre-integrated inertial measurement data is calculated by using the accelerometer and gyroscope measurement values containing noise in the time interval between two continuous frames of images.

The visual and inertial data fusion processing adopts a marginalization method to reduce the computational complexity of the VIO based on the nonlinear optimization, namely, the selective VIOState quantity x of earth inertia measured data_kAnd a characteristic quantity lambda_lMarginalizing from the sliding window while converting quantities corresponding to the edge states into a priori data, so that spatially separated keyframes remain in the sliding window while providing sufficient disparity for feature triangulation. When the next new frame image is the key frame image, the frame is retained in the sliding window, and the corresponding measurement value of the earliest frame image is marginalized. If the next new frame image is a non-key frame image, only the inertial measurement data is retained and the visual measurement data is discarded, and the inertial measurement data is associated with this non-key frame image. The marginalization calculation is done with Schur compensation and a new a priori value is constructed with all marginalized measurement data related to the removed state and added to the existing a priori values.

A＝a₁w₁+a₂w₂+a₃w₃+...+a_nw_n，

Wherein w_iDenotes the i-th feature vector in the bag of words, a_iIndicates the frequency of occurrence of the feature, [ a ]₁,a₂,...,a_n]I.e. describing vectors, the L1 norm is used to compare the distance between describing vectors, i.e.

When the loop is detected, the feature descriptors of the BRIEF are matched, the feature consistency is checked, and the connection between the local sliding window and the loop candidate item is established. Removing outliers by adopting a geometric outlier removing method, performing a basic matrix test by using a RANSAC algorithm, and performing the basic matrix test by using 2D feature retrieval in a current image and a candidate loop; PnP position detection is performed through a RANSAC algorithm, and PnP position detection is performed by using a 3D position with known characteristics in a local sliding window and a 2D projection in a candidate loop image.

Example two:

Monocular vision inertial odometry is a highly non-linear system, whereas the scale information is difficult to observe directly by a monocular camera, so a good initial value is required to directly fuse the inertial and visual measurements. Meanwhile, the inertial system has the problems of bias, gravity acceleration required when pose is solved, accurate speed information required when the pose moves rapidly and the like, so that the absolute scale factor of the IMU to the vision, gyroscope bias, acceleration bias, gravity acceleration and the speed of each IMU at each moment need to be calculated in the initialization process. Because the system needs to be initialized under the motion condition in practical application, an online calibration mode needs to be adopted to ensure that the system can stably operate under various environments.

Usually visual-only SLAM or Structure from Motion (SfM), have good initialization properties, i.e. the system is guided by deriving initial values from the relative Motion.

In the device initialization process, the visual SFM algorithm based on the sliding window is initialized, the pose change of the head-mounted augmented reality system is calculated, the IMU pre-integration result is matched with the visual SFM algorithm result, and the scale, gravity, speed and deviation are restored to guide the VIO based on the nonlinear optimization, as shown in FIG. 2.

The visual SFM algorithm based on the sliding window is initialized, the visual SfM can automatically recover the camera motion and the scene structure by using a plurality of scene images, and the sliding window technology is adopted to control the computation complexity while the visual SfM initialization process based on the sliding window estimates the motion by using the visual SfM.

The visual SFM algorithm based on the sliding window is initialized, and comprises characteristic point tracking, camera pose estimation and optimization, for the characteristic point tracking, firstly, the characteristic corresponding relation between the latest frame image of the image collected by the head-mounted augmented reality system and all frame images in the sliding window is checked, if the latest frame image and any frame image in the sliding window have more than a certain number of tracking characteristics (more than 30 tracking characteristics) and rotation compensation pixels (more than 20 rotation compensation pixels), the five-point algorithm is adopted to carry out relative rotation and scale translation on the latest frame image by using the tracking characteristics and the rotation compensation pixels, all the characteristics in the two frames are triangulated, and if the latest frame image does not exist, the latest frame image is stored in the sliding window, and a new frame image is waited; for camera pose estimation, estimating the poses of all frame images in a sliding window by utilizing the triangulated features and adopting a PnP method to further obtain the camera poses; and the optimization is realized by minimizing the total weight projection error of all the feature observations by using a BA optimization method, so that the optimization estimation of the camera attitude is completed. Where feature point tracking has been described above, the camera pose estimation is solved using an epipolar geometry approach, and the optimization will be set forth in the global pose graph optimization.

FIG. 3 is a schematic view of epipolar geometric constraint, where P is the position coordinates of the object in the real space corresponding to the feature points of the left and right frame images respectively acquired by the left-eye and right-eye cameras, and P is called the feature position, and the feature point P is the feature point P₁、p₂The projection of the real object on the left eye camera and the right eye camera respectively, and O1 and O2 are the centers of the left eye camera and the right eye camera respectively. Since the positions of the left and right objects of the binocular camera are accurately measured and are known values, the distance p can be used₁、p₂The depth of P is estimated. P is represented by the formula P ═ X, Y, Z]^TAccording to the pinhole camera model, then the feature point p₁、p₂Satisfies the relationship:

s₁p₁＝K₁P

s₂p₂＝K₂(RP+t)，

thereby obtaining a feature location estimation result. And according to the feature Point matching information and the feature position estimation, acquiring the three-dimensional space position of the current feature Point and the projection position of the current feature Point on the camera by using an SFM algorithm, and directly estimating the image posture by using a PnP (passive-n-Point) algorithm.

Assuming that a certain characteristic point P existing in a known space is observed at a certain position, the homogeneous coordinate of the certain characteristic point P is [ X, Y, Z,1 ]]^TProjecting the feature point x to the left eye of the binocular vision sensor to obtain the feature point x₁＝[u₁,v₁,1]^TDefining an augmentation matrix T ═ R | T]Is a 3 x4 matrix, which contains translation and rotation information, so the projection equation is:

sx₁＝K[R|t]P

where K is a known camera reference matrix. Since K is known, normalized plane coordinates can be used to remove the influence of the internal reference matrix K, and the above equation is abbreviated and expanded as:

the scaling factor s is eliminated with the last row, resulting in two constraints:

thus, there are:

because T is 12 unknowns in total, at least 6 groups of 3D-2D corresponding points are needed, 18 points are randomly selected from the currently measured feature points for more accurately estimating the pose of the camera, less than 18 points are all selected, and less than 6 points cannot be calculated, so that the pose is discarded. When the number of the redundant points is 6, the method adopts SVD (singular value decomposition) to solve the least square solution. A linear equation set is constructed using the set of 18 3D-2D corresponding points:

namely:

note the book

Is in the dimension of 18 x 12, and has the following structure,

is 11X 1 dimension, i.e.

Singular value decomposition of A, i.e.

The singular vector in V corresponding to the smallest singular value in the singular value matrix S is the least square solution, i.e., the least square solution of the camera pose estimation.

The IMU calibration generates equivalent values of gyroscope bias, speed, gravity acceleration and scale, and optimizes the gravity acceleration to provide accurate initial parameters for the vision-inertia combined calibration.

And

To obtain

wherein,

is to obtain the c₀The gravity vector measured during the frame of the image, s is the scaling factor for scaling the visual SFM algorithm to metric units. Considering two consecutive frames of images b in an image window_kAnd b_k+1Corresponding pre-integral terms thereof

And

the calculation formula of (2) is as follows:

the following linear measurement models were combined:

wherein

Wherein,

is obtained from the scale-matched monocular vision SfM, Δ t_kIs the time interval between two consecutive frames. By solving a linear least squares problem:

namely, parameters such as the speed of each frame in the body coordinate system, the gravity vector in the visual reference system, the scale and the like can be obtained from the sliding window.

The described gravity vector optimization, since the gravity vector is known, the gravity vector obtained by the linear initialization step can be optimized by limiting the norm. Since the gravity vector is known, the gravity vector has only 2 degrees of freedom left. The gravity can therefore be re-parameterized with two variables in its tangent space.

The gravity vector optimization firstly expresses the gravity vector as

Wherein, g is a known magnitude of gravity,

is a unit vector representing the direction of gravity, b₁And b₂Is that

Two orthogonal bases of the unfolded tangent plane, as shown in FIG. 4, w₁And w₂Respectively correspond to b₁And b₂Displacement of (2).

By the cross product operation of algorithm 1, a set of b1, b2 can be found. Then use

Replacing g in the formula and solving for w together with other state variables₁And w₂. Iterating the process until

And (6) converging.

Since the magnitude of the earth's gravity is known, g is located at a radius g ≈ 9.81m/s²On the spherical surface of (a). Parameter of gravity into

Wherein b is₁And b₂Are the two orthogonal bases of the spanned tangent plane.

After the gravity vector is refined, the world coordinate system and the visual sensor coordinate system c can be obtained by rotating the gravity to the z axis₀Between

Then all variables are converted from the reference coordinate system

Rotation to world coordinate System (.)^wThe velocity in the body coordinate system is also rotated to the world coordinate system.

The calibration of the vision sensor and the inertial measurement unit is to determine the 6D information of the relative attitude angle and position between the vision sensor and the inertial measurement unit after establishing the coordinate system of the vision sensor and the inertial measurement unit respectively. Because the original values in the coordinate system of the inertial measurement unit are directly obtained from the inertial measurement unit, the values obtained from the inertial measurement unit need to be converted into relevant data required by the visual sensor in the process of realizing the hybrid tracking of vision and inertia, and the data conversion relates to the relative 6D relationship between the two coordinate systems.

The combined calibration is carried out on the vision sensor and the inertia measuring equipment, C_objRepresenting an object coordinate system, and regarding the object coordinate system as a world coordinate system; c_cRepresenting a vision sensor coordinate system; c_iRepresenting the inertial measurement unit coordinate system. As shown in FIG. 5, C can be obtained by the visual sensor calibration method_cAnd C_objRelative position of (2), here by R_aAnd t_aRepresenting the pose and 3 translational fundamental components of the vision sensor relative to the world coordinate system, through C_iAnd C_cRelative position t of the middle_xObtaining C_iAnd C_objRelative position R of_x. Due to vision sensor and inertiaWhen the sex measuring device is relatively stationary, R is known_xAnd t_xIs a constant. The calibration of the vision sensor and inertial measurement unit is carried out by observing a known calibration reference in space at different positions to derive R_xAnd t_xThe relationship of the results was observed several times. FIG. 5 shows the slave position P_aMove to position P_bThe relative position between the coordinates of time.

The 4X4 matrix is represented by A, B, C, D and X, and respectively describes the relative orientation between two coordinate systems, if the relative orientation between the two coordinate systems has a rotation matrix R_aAnd a translation vector t_aDescription of the invention

Wherein R is_aAnd t_aThe subscript of (a) denotes the name of the matrix. In the figure C_objCoordinate system representing a calibration reference (i.e. the world coordinate system referred to earlier), C_c1And C_i1Represents t₁Time of day vision sensor coordinate system and inertial measurement unit coordinate system, C_c2And C_i2Represents t₂These two coordinate systems of time of day. C_c1、C_c2Relative to C_objThe extrinsic parameters can be determined by a vision sensor calibration method, denoted by a, B, respectively. If C represents C_c1And C_c2Relative orientation between (i.e. vision sensor by C)_c1Move to C_c2Rotation and translation of C ═ AB), then C ═ AB^-1。C_i1And C_i2Can be obtained directly from an inertial measurement unit, D represents C_i1And C_i2Relative orientation therebetween (i.e. inertial measurement unit consisting of C_i1Move to C_i2Rotation and translation in time). C_c1And C_i1Is X, then C_c2And C_i2Is also X, so X is a fixed value. Setting a certain point P in space as the above four coordinate systems C_c1，C_c2，C_i1，C_i2Respectively has a coordinate of p_c1，p_c2，p_i1，p_i2The following relationship is present:

p_c1＝Cp_c2

p_c1＝Xp_i1

p_i1＝Dp_i2

p_c2＝Xp_i2

the following can be obtained:

p_c1＝CXp_i3

p_c1＝XDp_i2the basic equation for calibrating the vision sensor and the inertial measurement unit is obtained, where CX ═ XD, and the relative positions X of the vision sensor and the inertial measurement unit are unchanged in the preceding and following periods.

If each 4X4 matrix of the formula CX ═ XD is written out with the corresponding rotation matrix and translation vector, respectively, then it can be written as:

the expansion formula can be obtained:

R_cR_x＝R_xR_d

R_ct+t_c＝R_xt_d+t

known in the above formula is R_c，R_d，t_c，t_dAnd R is_x，R_e，R_dAre all orthogonal identity matrices, the solution required is R_xAnd t_x。

From equation R_cR_x＝R_xR_dIt is possible to obtain:

in the formula, R_xIs a unit orthogonal matrix which defines the motion of any point in space, the motion can be regarded as rotating a theta angle around a certain k axis, the k axis and the theta angle are uniquely determined by an R matrix, therefore R_xThe matrix can be represented as R_x(k，θ)，k is a unit vector indicating the direction of the rotation axis, and the positive and negative of the angle θ are based on the right-hand rule. If R is_x＝R_x(k, θ), then:

namely R_xAre 1, e respectively^jθ，e^-jλ. U is R_xWherein the eigenvector corresponding to eigenvalue 1 is k (i.e., the first column of U is k). Can know that R_dAnd R_cAre similar matrices and have the same eigenvalues. Since the eigenvalues of the rotation matrix are uniquely determined by the rotation angle. So R_dAnd R_cThe defined rotation angles must be equal, and if we denote this rotation angle as θ, we can write the equation:

R_c(k_c，θ)R_x＝R_xR_d(k_d，θ)

wherein k is_cAnd k is_dAre each R_cAnd R_dA defined axis of rotation. Note that in practice R_cAnd R_dAre known data in the equation that must satisfy R_cAnd R_dThe condition that the defined rotation angle is equal to the determined R_xIs irrelevant.

If R is_c(k_c，θ)R_x＝R_xR_d(k_dθ), then k_c＝Rk_dAnd if

Is a special solution thereof, then all solutions R of the above formula_xCan all be expressed as

Wherein k is_cIs R_cA rotating shaft of (a); beta is an arbitrary angle.

From the above formula R cannot be solved uniquely_xThere is aThe individual degrees of freedom β are to be determined.

Can be represented by the formula R_ct+t_c＝R_xt_d+ t is written as:

(R_o-I)t＝R_xt_d-t_c

in the formula, R_c、t_c，t_dThe method comprises the following steps of (1) knowing; i is a unit diagonal matrix; t is t_xIs the parameter to be solved. From the properties of the rotation matrix described above, one can derive:

it can be seen that R_cOne of the characteristic values of-is other than 0, when θ is not equal to 0 (i.e., R)_cWhen not an identity matrix), R_cI is a rank 2 matrix, thus, even R_xCan be solved by the formula, only the relation t is given_xTwo independent linear equations of three components, so t_xCannot be uniquely determined by the formula, and has a degree of freedom to be determined.

If two movements are made during the calibration process (the coordinate system positions before and after the two movements are shown in fig. 6), the following four relations can be obtained:

R_c1R_x＝R_xR_d1，R_c1t_x+t_c1＝R_xtd₁+t_x，R_c2R_x＝R_xR_d2，R_c2t_x+t_c2＝R_xt_d2+t_x，

wherein R is_c1，t_c1，R_c2，t_c2Parameters of two movements are respectively given by external parameters obtained by a three-time vision sensor; r_d1，t_c1，R_d2，t_d2Given by the two inertia measurement devices. Jointly determining R from the above formula_xThen R is added_xSolve t by substitution_x. Due to the solution of R_xThe following two equations must be satisfied simultaneously:

k_c1＝R_xk_d1

k_c2＝R_xk_d2

wherein k is_c1，k_c2，k_d1，k_d2Are each R_c1，R_c2，R_d1，R_d2A unit vector in the determined rotation axis direction. Due to R_xSimultaneously adding R_d1Go to k_c1，k_d2Go to k_c2Then must R_xWill k_d1×k_d2Go to k_c1×k_c2The relationship is put in a matrix equation at the same time to obtain

(k_c1 k_c2 k_c1×k_c2)＝R_x(k_d1 k_d2 k_d1×k_d2)

When k is_c1And k is_c2When not parallel to each other, the matrix in the above formula is a full rank matrix, and then

R_x＝(k_c1 k_c2 k_c1×k_c2)(k_d1 k_d2 k_d1×k_d2)^-1

R solved in the formula_xSubstitution, one can obtain information about t_xAny three of the four independent linear equations of (1) can be solved for t_xOr solving t from four equations by least squares_x. It is particularly noted that the above-mentioned two movements R_d1And R_d2When the rotating shaft portions are parallel to each other, R is solved_xIs unique, it can be seen that when R is_c1And R_c2When the diagonal matrix is non-unity (i.e. motion is not pure translation), t_xAnd is also uniquely determined.

The calibration between the visual sensor and the inertial measurement unit can be completed only by data of three positions of the visual sensor and the inertial measurement unit. In order to reduce the calibration error, calibration is performed by using data of multiple measurements, and the data is expressed by a quaternion form as follows:

dot product on unit quaternion matrix

Is two quaternions q₁And q is₂Similarity measurements, and therefore the optimal rotation matrix for the above equation is:

this can be finally expressed by the above formula:

the final goal is to solve the above equation. If a rough estimate of the equation is made, it is recombined in the following linear form:

ΔQ_d＝(q_c1，…，q_cn)

ΔQ_c＝(q_d1，…，q_dn)

here, n measurements are represented by one 4X4, and

then the equation can be expressed as:

ΔQ_d≡SΔQ_c

rotating the matrix R by the above method_xIf the formula is replaced, t can be represented_xAnd (4) obtaining.

The method is used for jointly calibrating a visual sensor and an inertial measurement unit, and needs to obtain multi-dimensional parameter information from an inertial measurement unit, and the implementation process comprises the following steps: firstly, the 3D information is directly obtained, namely a group of three Euros are directly obtained from an inertial measurement unitTarget attitude angle expressed by pull angle, and raw data angular velocity and linear acceleration, so as to obtain required target attitude time t_xThen, assume that the target initial position is zero, i.e. initial position s_c0, target position s after time Deltat of initial velocity v₁By a linear formula s₁＝vΔt+1/2·aΔt²And estimating, wherein a is target acceleration, and the multidimensional parameter information comprises a target attitude angle, an original data angular velocity, linear acceleration, target attitude time and a target position.

The feature points of the corner points are tracked as local features because the feature points have strong uniqueness in a local range. For the corner feature point, the detection is realized by using a second derivative matrix of the image gray intensity, namely a two-dimensional Hessian matrix, wherein the second derivative matrix of the image gray intensity is expressed as:

wherein, w_i,j is the normalized weight ratio of the pixels in row I and column j within the fixed window to produce a circular window and a Gaussian weight, K is the number of pixels per row or column in the fixed window, I_x(x, y) and I_y(x, y) represents the gray intensity of two pixels used to calculate the correlation at the x-th row and y-th column of the fixed window around it. And calculating the characteristic value of the autocorrelation matrix at each pixel point, and if the minimum characteristic value is still greater than the set minimum threshold value, taking the pixel point as the corner characteristic point.

The invention adopts an LK optical flow method to track the characteristic points. The optical flow describes the motion of pixels in an image, in LK optical flow the image from the camera is time-varying and therefore can be seen as a function of time: i (t), i.e. at time t, the pixel at (x, y) has a grey scale I (x, y, t), the method considers the image as a function of position and time, the value range of which is the grey scale of the pixel in the image. For a certain fixed point at time t, the pixel coordinate is (x, y), and since the motion image of the camera changes, in order to estimate the motion of the pixel, the present invention assumes that the gray value of the same point in space on the image is fixed, that the pixels in a certain window have the same motion, and that the motion of the camera is slow.

For a pixel at (x, y) at time t, it moves to (x + dx, y + dy) at time t + dt, but its gray level is unchanged, i.e., it is

I(x+dx,y+dy,t+dt)＝I(x,y,t)

Taylor expansion at I (x, y, t) to the left as

Wherein n is⁽¹⁾Is high order infinitesimal, and can be obtained after neglect

I′_xdx+I′_ydy+I′_tdt＝0，

Both sides are divided by dt to give:

wherein

Is the speed of movement of the pixel in the x-axis, and

the velocity of motion on the y-axis is denoted as u, v. I'_xIs the gradient, l ', of the image in the x-direction at that point'_yIs a gradient in the y-direction, I'_tFor the time variation of the image gray scale, the matrix form is written as follows:

consider a window of size w x w, containing w²And (4) a pixel. Since the pixels within the window have the same motion, w is obtained²The equation:

written as a matrix:

recording:

by solving the least squares solution of the above equation, an estimate of pixel motion is derived

Tracking the characteristic points of each frame of image through an LK sparse optical flow algorithm, and selecting a key frame image by adopting an average parallax standard or a tracking quality standard; in the average parallax criterion, if the average parallax of the tracking features between the current frame image and the latest key frame image exceeds 10 °, the current frame image is taken as a new key frame image, and the image rotation is compensated by using the short-term integral value of the measurement result of the inertial measurement device in calculating the average parallax. Both translation and rotation result in parallax, and since features in purely rotational motion cannot be triangulated and cannot be used to estimate relative depth, short-term integrals of gyroscope measurements are used to compensate for rotation in calculating parallax. Because the rotation compensation is only used for selecting the key frame, even if the gyroscope has larger noise or bias, the unreasonable key frame selection result can be only caused, and the pose estimation quality can not be directly influenced. In the tracking quality standard, if the number of tracking features is lower than a certain set value, the current frame image is taken as a new key frame image to avoid the complete loss of the feature track. The set value may be 100.

The output frequency of the IMU is 200Hz, the optimization variable can be rapidly increased and the optimization real-time performance is damaged due to the rapid and large amount of data, the IMU sampling data between two frames of images is changed into a constant by a pre-integration method, the optimization variable is greatly reduced, and meanwhile, a covariance propagation equation is deduced by utilizing continuous time IMU error state dynamics.

Said IMU integral model, IMU raw gyroscope and accelerometer measurements, i.e.

And

is represented as follows:

the directly obtained inertial data measurement value is in the body coordinate system, so that the inertial data measurement value contains gravity information and actual inertial motion information of the head-mounted augmented reality system and is subjected to acceleration bias b_aGyro offset b_ωAnd the effects of additive noise. The acceleration and gyroscope measurements are assumed to contain gaussian noise, i.e.:

the bias terms of acceleration and gyro are both subject to a random walk model, namely:

its derivative follows a gaussian distribution, i.e.:

combining with image frames b_kAnd b_k+1Corresponding to two time points, calculating the time interval t by inertia measurement_k,t_k+1]The position, the speed and the orientation state of the inner head-mounted augmented reality system under a world coordinate system are calculated according to the following formula:

wherein:

Δt_kis [ t_k,t_k+1]The time interval in between.

It can be seen that IMU state propagation requires frame b_kThe rotation, position and speed of the rotor. Therefore, in the global attitude map optimization algorithm adopted by the invention, the state propagation equation between two adjacent frames needs to be recalculated and modified each time the head-mounted augmented reality system adjusts the attitude. And therefore require significant computational resources. To avoid re-propagation, the present invention employs a pre-integration algorithm.

And angular velocity

The relevant variables are pre-integrated as follows:

wherein:

obviously, with b_kThe pre-integral term may be obtained from IMU measurements only for the reference frame,

only related to deviations of IMU, b_kAnd b_k+1When the estimate of the deviation changes, if the change is small, it can be adjusted by its first approximation with respect to the bias

Otherwise, it will be recalculated, which can save a lot of computing resources for optimization-based algorithms since it is not necessary to repeatedly calculate IMU bias from IMU measurements.

The invention selects a Euler integration method to realize discrete numerical integration for the computer. In the initial state of the process, the temperature of the molten steel is controlled,

is a non-volatile organic compound (I) with a value of 0,

is a quaternary element of identityAnd (4) counting. The average value of α, β, γ in the formula is calculated stepwise. Due to the additional noise term n_a，n_ωObeying a gaussian distribution, is considered zero in the present invention. In summary, the estimated value of the pre-integral term is:

wherein i corresponds to a value at [ t_k,t_k+1]The discrete moments measured by the IMU within. δ t time interval between two IMU measurements i and i + 1.

For the calculation of IMU Jacobian matrix and covariance matrix, quaternion is rotated due to four dimensions

Over-parameterized, its error term can be defined as the perturbation around its mean:

wherein

Is a three-dimensional small perturbation.

The continuous linear error kinetic term in the equation is therefore:

wherein,

the update is recursively computed by a first order discrete-time covariance having an initial covariance of

The recursive formula is:

where Q is the diagonal covariance matrix of the noise

At the same time with

In connection with

The first order Jacobian matrix of (1) can also be used as the initial Jacobian matrix

And (3) recursive calculation:

J_t+δt＝(I+F_tδt)J_t,t∈[k,k+1]，

by using the recursion formula, a covariance matrix is obtained

And Jacobian matrix

The first order approximation for the deviation can be written as:

wherein

Is that

The position of the sub-block matrix corresponds to

The same meaning is also used. When the estimation of the deviation slightly changes, the pre-integration result can be approximately corrected by the formula, the integration needs to be recalculated, and the calculation amount is reduced. Using covariance

Representing the IMU measurement model as follows:

the visual and inertial data fusion process described, using two successive frame images b in a sliding window_kAnd b_k+1According to the IMU measurement model, using the accelerometer and gyroscope measurements containing noise in the time interval between two consecutive frames of images, calculating the residual error of the pre-integrated inertial measurement data as:

wherein [ ·]_xyzThe vector portion of the extracted quaternion q represents the error state,

three-dimensional error being quaternionThe status is represented by a status indication,

the method is an IMU pre-integration measurement, the part uses accelerometer and gyroscope measurement values containing noise in a time interval between two continuous frames of images, and the online correction residual error also contains the deviation of the accelerometer and the gyroscope.

For the residual of the vision measurement, unlike the pinhole camera model that defines the reprojection error on the generalized image plane, the present invention measures the residual using a vision sensor projected onto a unit sphere. Considering the l-th feature observed in the i-th frame image for the first time, the residual error of feature observation in the j-th frame image is defined as:

wherein,

is the first observation of the ith feature in the ith frame of image,

is the observed value of the same feature in the image of the jth frame,

is a back-projection function that converts pixel locations to unit vectors using intrinsic parameters of the vision sensor. Since the degree of freedom of the visual residual is 2, the residual vector can be projected onto a tangent plane. b₁,b₂Is formed by stretching

Two arbitrarily chosen orthogonal bases of the tangent plane, as shown in figure 7,

is a standard covariance matrix with a fixed length in tangent space. Wherein,

is the unit vector that observes the l-th feature in the j-th frame. By converting the feature result observed for the first time in the ith frame into the jth frame, the feature measurement value on the unit sphere is predicted

Residual vector is in

On the tangent plane of (c).

The visual and inertial data fusion process adopts a marginalization method to reduce the computational complexity of the VIO based on the nonlinear optimization, namely, the state quantity x of the inertial measurement data is selectively used_kAnd a characteristic quantity lambda_lMarginalizing from the sliding window while converting quantities corresponding to the edge states into a priori data, so that spatially separated keyframes remain in the sliding window while providing sufficient disparity for feature triangulation. As shown in fig. 8, when the next new frame image is the key frame image, the frame is retained in the sliding window, and the corresponding measurement value of the earliest frame image is marginalized. If the next new frame image is a non-key frame image, only the inertial measurement data is kept and the visual measurement data is discarded, and the inertial measurement data is associated with the non-key frame image, and in order to keep the sparsity of the system, all non-key frame measurements do not need to be marginalized. The marginalization calculation is done with Schur compensation and a new a priori value is constructed with all marginalized measurement data related to the removed state and added to the existing a priori values.

After the estimator is initialized, the state estimation of the head-mounted augmented reality system is performed by using the sliding window based tightly-coupled monocular VIO, as shown in fig. 9.

The complete state vector in the sliding window is:

wherein x is_kIs to capture k^thThe IMU state at the time of the frame image includes the position, speed and direction of the IMU in the world coordinate system, and the acceleration bias and gyroscope bias of the IMU in the body coordinate system. n is the total number of key frames and m is the total number of features in the sliding window. Lambda [ alpha ]_lIs the inverse depth of the ith feature from its first observed position.

The invention adopts a clustering adjustment method facing to visual inertia to perform fusion optimization. The maximum a posteriori estimate is obtained by minimizing the sum of the prior value and the Mahalanobis norm of all measured residuals:

wherein, the Huber norm is:

wherein,

and

residual and visual measurements of the IMU, respectively, B is the set of all IMU measurements, C is the set of features observed at least twice in the current sliding window, { r_p,H_pIs a priori information from marginalization.

The sliding window and marginalization scheme effectively limits computational complexity while drifting in the global 3D position (X, Y, Z) estimation and rotation around the direction of gravity (yaw). To eliminate drift, the present invention employs a tightly coupled relocation method seamlessly integrated with a monocular VIO.

In the repositioning process, loop detection is realized by using a bag-of-words model, the bag-of-words model is firstly established according to feature point data actually acquired in a space operating environment, when the position of a visual sensor changes, a new feature point set is obtained by real-time calculation, the feature point set is matched with a key frame feature description vector after forming a new description vector, and if the matching is successful, loop detection is realized. Recognizing the visited positions by adopting a bag-of-words model, then establishing a characteristic level connection between a candidate closed loop and a current frame image, and integrating a characteristic pair into a visual-inertial odometer VIO based on nonlinear optimization to obtain a drift-free target state estimation. Because the relocation uses multiple observations of multiple features, the positioning accuracy and the state estimation smoothness are effectively improved. Wherein the description vector only concerns whether a certain feature appears or not, and does not consider the position where the feature appears, and the feature at the position is described as the feature in the position by matching the feature point set obtained at the position with the word bag

A＝a₁w₁+a₂w₂+a₃w₃+...+a_nw_n，

The bag-of-words model is constructed by adopting a clustering method, and in order to avoid too much time consumption in the characteristic matching process, a k-ary tree is used for representing the bag-of-words. The method comprises the steps of utilizing feature point data obtained by actually collecting all features obtained in a preliminary test, randomly selecting a plurality of feature point data from the feature point data as samples to perform clustering, and expressing the data by adopting a 10-fork tree with the depth of 5, so that 10 ten thousand features can be accommodated. Because the image obtained by the vision camera at each time has features which are generally not more than 500, the bag of words can relatively completely express the feature distribution condition of one image, and the 5-layer 10-fork tree description bag of words is as shown in fig. 10. The clustering method is realized by adopting a K-Mean algorithm, the clustering number K is determined, K D-dimensional feature point data vectors are randomly selected as initial class centers, and D is the dimension of the feature point data; calculating the distance between each feature point data vector and a clustering center (taking L1 norm), and selecting the nearest clustering center as the class to which the feature point data vector belongs; within the same class, the cluster center is recalculated, and the calculation and classification processes are iterated until the sample data is converged. The data of the first four layers are processed as above, and then the classified bag-of-words model can be obtained. Loop back detection is implemented using the TF-IDF algorithm. Intuitively, the loop detection can be judged only by considering whether the picture obtained at each position is consistent with the picture obtained before, namely whether the description vector is close enough. The invention considers that the importance of different characteristics on the distinction is different, and certain characteristics with particularly high occurrence Frequency actually cannot really represent the key information of a picture, so a TF-IDF (Term Frequency-Inverse Document Frequency) method is introduced, wherein TF, namely the Frequency of a certain characteristic appearing in a single picture is higher, the distinction degree of the characteristic on the picture is higher, and IDF refers to the lower Frequency of the certain characteristic appearing in a word bag, the more easily the picture with the characteristic appears is classified correctly. The IDF and TF parts are considered in establishing the bag of words:

where n is the number of all features and ni is the frequency of occurrence of the feature wi (i.e., a certain leaf node) in the sample used to create the bag of words. The weight η of wi_iIs composed of

η_i＝IDF_i×TF_i，

Considering the weights, the new image A can be expressed as

Comparing the image A with the calculated frame, when the difference between the description characteristics is less than 5%, the invention considers that the image A reaches the position once, and then the loop detection is finished.

In fact, because the change of two adjacent frames is extremely small and the two adjacent frames are easy to be judged as loop, the loop detection is meaningless, so that the invention introduces a clustering mode, compares the images in a short period, discards the frames which are too close, and only retains the key frames, thereby greatly reducing the overhead of the system for loop detection. In addition to the corner features for monocular VIO, the system also detects an additional 500 corner points through BRIEF descriptors, and the additional corner point features improve the recall ratio of loop detection. The invention adopts a method of reserving all BRIEF descriptors for feature retrieval and discarding the original image to reduce the memory consumption and improve the system performance.

When a loop is detected, the feature descriptors of BRIEF are matched, feature consistency is checked, and connection between a local sliding window and loop candidates is established. However, since a large number of outliers may be generated by direct descriptor matching, outliers are removed by a geometric outlier removal method, a basis matrix test is performed by the RANSAC algorithm, and the basis matrix test is performed using 2D feature retrieval in the current image and the candidate loops; PnP position detection is performed through a RANSAC algorithm, and PnP position detection is performed by using a 3D position with known characteristics in a local sliding window and a 2D projection in a candidate loop image. When the number of inliers exceeds 50, this candidate is treated as a correct loop back detection and relocation is performed.

For tightly coupled repositioning, the repositioning process efficiently correlates the current sliding window and the historical pose graph maintained by the monocular VIO, during which repositioning the poses of all loop frames are processed to constants, and jointly optimizes the sliding window using IMU measurements, local vision measurements, and detecting feature consistency in the loop. Fig. 11(a) is a relocation process that starts with VIO-only pose estimation (blue), records the historical state (green), and if a loop back is detected in the latest frame (red line in fig. 2), a relocation occurs. Since relocation uses feature-level association, the system can easily check for loop-back constraints from multiple key-frames. Fig. 11(b) is a process of pose graph optimization, when the last key frame in the sliding window is rimmed, the key frame will be added to the pose graph, and if there is a loop between the key frame and other key frames, a loop constraint represented as a 4 degree of freedom versus rigid body transformation will be added to the pose graph. Features observed in the loop frame v have the same visual measurement model as in the VIO visual measurement, but the loop frame pose obtained from the pose graph or directly from the previous odometer output

Then processed as a constant and therefore the non-linear loss function in the equation is changed to:

where L is the set of feature points observed in the loop frame, and (L, v) is the L-th observed feature point in the loop frame v. Note that although the penalty function is slightly different from the above equation, the dimension of the state to be solved remains unchanged because the pose of the loop frame is considered to be constant. When a plurality of loops are established in the current sliding window, all loop features from all frames are simultaneously used for optimization to form a multi-view constraint of relocation, so that the positioning accuracy and smoothness are improved.

The global pose graph is optimized, and after relocation, a local sliding window is moved and aligned with a past position. With the relocation result, registration of the past set of poses into a globally consistent configuration is ensured by global pose graph optimization. The cumulative drift occurs in only four degrees of freedom (x, y, z and yaw) due to the visual inertial settings making roll and pitch angles fully observable. For this reason, only optimization of the 4dof attitude map was performed, ignoring the estimation of the drift-free roll and pitch states. When the key frame is rimmed from the sliding window, adding the rimmed key frame into the pose graph, and taking the rimmed key frame as a new vertex in the pose graph, and connecting the new vertex with other vertexes in the pose graph through a sequence edge and/or a loop edge: the newly rimmed key frames will establish several sequence edges to associate themselves with the existing key frames, the sequence edges representing the relative transition between two key frames in the local sliding window, the values of which are taken directly from the VIO. E.g., a newly rimmed key frame i and its previous key frame j, the sequence edge contains only the relative position

And yaw angle

For 4-dof pose graph optimization, the residual of the edge between frames i and j is defined as:

wherein,

are roll and pitch angle estimates derived directly from monocular VIO. The entire graph of sequence and loop edges is optimized by minimizing the following cost function:

where S is the set of all sequence edges and l is the set of all loop edges, although tightly coupled relocation is advantageous for eliminating false loops, the present invention additionally introduces the Huber specification ρ (·) to further reduce the impact of false loops. The pose graph optimization and the repositioning are operated independently, so the repositioning can be performed when the pose graph is optimized, and the repositioning can be performed by using the existing pose graph when the pose graph optimization is not completed.

For pose graph management, as travel distance increases, the pose graph may grow indefinitely, affecting the long-term performance of the system. Therefore, the invention adopts a down-sampling process to maintain the size of the pose graph database, that is, only all key frames with loop constraints are retained, while other key frames too similar may be deleted, and the probability of deleting the key frames with greater similarity is higher.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for tracking the mixed position and pose of vision and inertia of a head-wearing augmented reality system is characterized by firstly carrying out equipment initialization, comprising the steps of carrying out measurement preprocessing on a vision sensor and an inertia measuring device, carrying out combined calibration on the vision sensor and the inertia measuring device, then carrying out fusion positioning on the head-wearing augmented reality system by using measurement data acquired by the vision sensor and the inertia measuring device, carrying out loop detection and global position and pose diagram optimization on a positioning result, and obtaining a final position and pose estimation result.

2. The method for tracking mixed pose of vision and inertia of head-mounted augmented reality system according to claim 1, wherein the measurement preprocessing comprises processing frame images acquired by the vision sensor and performing pre-integration on the corresponding inertial measurement results of two consecutive frame images.

3. The method for tracking the mixed pose of vision and inertia of the head-mounted augmented reality system according to claim 1, wherein the fusion positioning process fuses the pre-integrated inertia measurement result and the visual image feature observation result to realize the fusion positioning of the head-mounted augmented reality system.

4. The method for tracking mixed pose of vision and inertia of head-mounted augmented reality system of claim 1, wherein the global pose graph is optimized to eliminate drift by global optimization using repositioning results.

5. The method for tracking mixed pose of vision and inertia of head-mounted augmented reality system of claim 1, wherein the device initialization process is performed by initializing visual SFM algorithm based on sliding window, calculating pose change of the head-mounted augmented reality system, and restoring scale, gravity, speed and deviation by matching IMU pre-integration result with visual SFM algorithm result for guiding VIO based on non-linear optimization.

6. The visual and inertial mixed pose tracking method of the head-mounted augmented reality system according to claim 5, wherein the sliding window-based visual SFM algorithm is initialized, and comprises feature point tracking, camera pose estimation and optimization, for the feature point tracking, firstly, the feature correspondence between the latest frame image of the images collected by the head-mounted augmented reality system and all frame images in the sliding window is checked, if the latest frame image and any frame image in the sliding window have more than a certain number of tracking features and rotation compensation pixels, the latest frame image is subjected to relative rotation and scale translation by using the tracking features and the rotation compensation pixels, all features in the two frames are triangulated, and if the latest frame image does not exist, the latest frame image is stored in the sliding window, and a new frame image is waited; for camera pose estimation, estimating the poses of all frame images in a sliding window by utilizing the triangulated features and adopting a PnP method to further obtain the camera poses; the optimization is carried out, wherein the total weight projection error observed by all the characteristics is minimized by using a BA optimization method, so that the optimization estimation of the camera attitude is completed; p is the position coordinates of the real object in the real space corresponding to the feature points of the left and right frame images respectively acquired by the left eye camera and the right eye camera, and is called as the feature position, and the feature point P₁、p₂The projection of the real object on the left eye camera and the right eye camera respectively, and O1 and O2 are the centers of the left eye camera and the right eye camera respectively; p is represented by the formula P ═ X, Y, Z]^TAccording to the pinhole camera model, then the feature point p₁、p₂Satisfies the relationship:

s₁p₁＝K₁P

s₂p₂＝K₂(RP+t)，

wherein, K₁、K₂Reference matrixes of the left eye camera and the right eye camera respectively, R, t rigid body transformation moment between the left eye camera and the right eye camera respectivelyArray sum bias matrix, s₁、s₂The least square solution of P is calculated for the proportionality coefficients of the left eye camera and the right eye camera respectively as follows:

thereby obtaining a characteristic position estimation result; and according to the feature point matching information and the feature position estimation, acquiring the three-dimensional space position of the current feature point and the projection position of the current feature point on the camera by utilizing an SFM algorithm, and then estimating the image posture by directly utilizing a PnP algorithm.

7. The method for tracking mixed pose of vision and inertia of head-mounted augmented reality system according to claim 1, wherein the vision sensor and the inertial measurement unit are calibrated jointly, and the inertial measurement unit is calibrated to obtain the initial estimation result of the target pose, velocity, gravity vector, gyroscope bias and three-dimensional feature position, which is used for guiding the vision-inertial odometer VIO based on nonlinear optimization to calculate the absolute scale factor, gyroscope bias, acceleration bias, gravity acceleration and relative velocity of the inertial measurement unit to the vision sensor at each measurement time.

8. The visual and inertial hybrid pose tracking method of a head-mounted augmented reality system of claim 6,

the calibration of the inertial measurement equipment comprises the calibration of the bias of a gyroscope, the optimization of gravity vectors, the calibration of speed, gravity vectors and scales; the gyroscope bias is calibrated by considering two continuous frame images b in the image window_kAnd b_k+1And obtaining the corresponding rotation compensation pixel by using visual SFM algorithm

And

b, collecting the serial numbers of all the frame images in the image window; using biased jacobian matrices

To obtain

Solving the minimum loss function to obtain the gyroscope bias b by the first-order approximation of the gyroscope bias and delta as a calibration coefficient_ωInitial calibration of (3);

wherein,

9. The method for tracking mixed pose of vision and inertia of head-mounted augmented reality system of claim 1, wherein the vision sensor and the inertial measurement device are calibrated jointly, C_objRepresenting an object coordinate system, and regarding the object coordinate system as a world coordinate system; c_cRepresenting a vision sensor coordinate system; c_iRepresenting an inertial measurement unit coordinate system; c can be obtained by a visual sensor calibration method_cAnd C_objRelative position of (2), here by R_aAnd t_aRepresenting the pose and 3 translational fundamental components of the vision sensor relative to the world coordinate system, through C_iAnd C_cRelative position t of the middle_xObtaining C_iAnd C_objRelative position R of_x(ii) a The calibration of the vision sensor and inertial measurement unit is carried out by observing a known calibration reference in space at different positions to derive R_xAnd t_xThe relationship of multiple observations;

the method is used for jointly calibrating a visual sensor and an inertial measurement unit, and needs to obtain multi-dimensional parameter information from an inertial measurement unit, and the implementation process comprises the following steps: firstly, directly acquiring 3D information of the three-dimensional object, namely directly acquiring a set of target attitude angles represented by three Euler angles from an inertial measurement device, and angular velocity and linear acceleration of raw data, so as to acquire required target attitude time t_xThen, assume the target initial position is zero, i.e. the initial position s₀0, target position s after time Deltat of initial velocity v₁By a linear formula s₁＝vΔt+1/2·aΔt²The estimation is carried out in such a way that,and a is target acceleration, and the multidimensional parameter information comprises a target attitude angle, an original data angular velocity, linear acceleration, target attitude time and a target position.

10. The method for tracking the mixed pose of vision and inertia of the head-mounted augmented reality system according to claim 1, wherein the fusion positioning of the head-mounted augmented reality system by using the measurement data collected by the vision sensor and the inertia measurement device comprises measurement preprocessing, fusion of the vision and inertia data, repositioning and global attitude map optimization.