CN117870659A

CN117870659A - Visual inertial integrated navigation algorithm based on dotted line characteristics

Info

Publication number: CN117870659A
Application number: CN202410025341.8A
Authority: CN
Inventors: 赵立业; 陈锐锋; 杨皓明; 李宏生; 黄丽斌; 丁徐锴
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2024-01-08
Filing date: 2024-01-08
Publication date: 2024-04-12

Abstract

The invention provides a visual inertial integrated navigation algorithm based on dotted line characteristics, which is used for respectively extracting and matching the dot and line characteristics of an image and fusing visual inertial data based on a multi-mode estimation algorithm. The algorithm comprises the following steps: extracting feature points of the image based on an ORB method; extracting characteristic line segments of the image based on Canny edge detection and Hough transformation; the texture condition of the scene is calculated by using the gray level co-occurrence matrix, and the number of required point and line characteristics is distributed according to the result, so that the system has sufficient flexibility, the precision is ensured, and the calculation efficiency is improved; matching the extracted dotted line features and recovering three-dimensional information of the features by applying a parallax method; the application of multimodal estimates to fuse visual information with information from a speedometer, where the application of interactive multimodal algorithms operates in parallel using a plurality of filters, each representing a different motion model or hypothesis, and then synthesizes these estimates based on their performance, enables the system to adapt to a variety of operating environments.

Description

Visual inertial integrated navigation algorithm based on dotted line characteristics

Technical Field

The invention belongs to the field of navigation, and particularly relates to a visual inertial integrated navigation algorithm based on dotted line characteristics.

Background

The visual inertial integrated navigation algorithm relates to three points, namely a visual navigation method, an inertial navigation method and a data fusion method. There are many visual navigation methods currently available for the first point, such as optical flow methods, to estimate motion by analyzing pixel variations between successive frames of a camera; identifying and tracking feature points or feature areas in the scene, such as corner points, edges or specific patterns, based on the feature method to estimate motion; the direct method directly uses pixel intensity information of an image, instead of extracting feature points or the like. For the data fusion method, the current mature scheme is a filter method which uses a Kalman filter and variants thereof to fuse visual and inertial data; the nonlinear optimization method solves the optimal state by constructing a cost function and then using nonlinear optimization techniques.

There are some disadvantages to the existing technology related to visual inertial integrated navigation. For example, optical flow methods rely on analyzing pixel variations between successive frames of a camera to estimate motion, which is highly dependent on visual features in the environment. In environments with poor features or repetitive features (such as single-colored walls or highly repetitive textures), the optical flow algorithm may be difficult to work accurately, in dynamic environments, especially when there are multiple moving objects in the scene, the effect of the optical flow method is greatly compromised because it is difficult to distinguish camera motion from the motion of objects in the scene, the optical flow algorithm is very sensitive to illumination conditions, and intense illumination changes or shadows may lead to inaccurate optical flow estimation; the feature method is a method depending on detecting and tracking specific feature points or feature areas in an image, and although the method is excellent in many applications, in environments where features are lacking (such as pure color walls) or too many (such as complex patterns), it may be difficult to correctly extract and track the feature points, feature detection is generally affected by illumination conditions and changes in viewing angles, original features may be difficult to identify under extreme illumination or viewing from different angles, and in dynamic environments, moving objects may introduce wrong feature points, resulting in misjudgment of a navigation system; the direct method is a method which does not depend on clear feature extraction, the method directly uses pixel intensity information of an image to estimate motion, the method is superior to the feature method in some aspects, but the direct method is very sensitive to illumination change, the illumination change influences the pixel intensity, so that the accuracy of motion estimation is influenced, the direct method tends to have larger calculation amount because a large amount of pixel data needs to be processed, the method can be challenging to a system with limited calculation resources, in a dynamic environment, a moving object can cause the pixel intensity change, the direct method can be difficult to distinguish camera motion from motion of an object in a scene, as in the feature method, the direct method can not acquire real world scale information from a single image, additional information or hypothesis is needed to solve the problem of scale ambiguity, the direct method has higher dependence on image quality and texture distribution, the image ambiguity or insufficient texture area can cause estimation failure, and the direct method usually works in a local image window, so that the method based on the feature is not effective in the aspect of global optimization; regarding filtering methods of data fusion, filters such as extended kalman filters process data by linearizing nonlinear models, which may lead to estimation errors, especially when the system dynamics are large, such as complex robot motions or extreme flight dynamics, the traditional filtering methods may have difficulty in accurately estimating states, and the performance of the filters depends largely on their parameter settings. Improper parameter selection may lead to performance degradation, and these parameters typically require manual adjustment and calibration; regarding the nonlinear optimization method of data fusion, although having a certain advantage in terms of improving the accuracy and robustness of the navigation system, the nonlinear optimization method generally requires a large amount of computing resources, and may cause high power consumption and delay particularly when processing high-resolution images and high-frequency inertial data, and the convergence and stability of the nonlinear optimization problem are affected by initial guesses, the shape and constraint conditions of the optimization problem, and in some cases, the optimization may not converge or may take a long time to converge to a reasonable solution.

Disclosure of Invention

In order to solve the problems, the invention discloses a visual inertial integrated navigation algorithm based on dotted line characteristics, which is used for respectively extracting, describing and matching characteristic points and characteristic line segments of images, and then fusing visual information and information of an inertial odometer based on a multi-mode estimation algorithm, so that the calculation efficiency is improved, the method can adapt to complex motions or extreme flight dynamics, realizes accurate position and direction tracking, has more flexible characteristic selection and has better adaptability to the environment.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

the visual inertial integrated navigation algorithm based on the dotted line features comprises the following steps:

(1) Preprocessing IMU data;

(2) Extracting binocular images;

(3) Processing characteristic points of the image;

(4) Processing characteristic line segments of the image;

(5) Distribution of the number of dot line features;

(6) Motion estimation;

(7) And (5) data fusion.

Further, the step (1) is specifically as follows:

(11) The measured data are obtained from the IMU, and the data obtained from the six-axis IMU are the angular velocity (omega/s) and the acceleration (m/s) in three directions of x, y and z ² )。

(12) And preprocessing the measured six-axis IMU data.

The IMU provides data at a very high frequency, and the pose change information obtained by accumulating the high-frequency data of the IMU between two adjacent frames of images is utilized to convert continuous IMU measurement data into relative motion between the discrete camera frames by pre-integrating the IMU data, so that the number of variables involved in optimization is reduced, and the computational complexity is reduced.

(13) Alignment of visual and IMU time stamps is performed.

In a vision-IMU fusion system, time synchronization is a critical process to ensure that IMU data and camera frames are aligned in time. Since the data acquisition rates of IMUs and cameras are typically different, interpolation or other methods need to be used to ensure data synchronization.

The following expressions are assumed:

t _c : a timestamp of the current camera frame;

t _i-1 and t _i : respectively is with t _c Time stamp of the last two IMU data, where t _i-1 <t _c <t _i ；

s _i-1 Sum s _i : respectively at t _i-1 And t _i Is a sample of IMU data;

for a given camera frame time stamp t _c Acquiring IMU data at the moment; a linear interpolation method is used. For any given quantity s, (such as acceleration or angular velocity), at time t _c Is expressed as an interpolation of (a).

This formula takes into account the variation between two adjacent IMU data samples and estimates the time stamp t at the camera frame _c Values at that time.

Further, the step (2) is specifically as follows:

the image is extracted by using a binocular camera, the image data is transmitted to an algorithm processing system through an ROS system, and the image information in the ROS format is converted into a cv:Mat format and then is processed.

Further, the step (3) is specifically as follows:

(32) To achieve scale invariance, an image pyramid is first created. This involves generating multiple scale versions of the image, typically each new scale being a scaled down version of some fixed factor of the original image. In this way, even if the object changes in size in the image, its features can be detected on different scales.

An image pyramid consists of a series of progressively smaller image layers. The bottom layer is an original size image, while each higher level layer is a scaled-down version of the previous layer. Set I ₀ For the original image, for level I of each pyramid, image I _i By downsampling the image I _i-1 Obtained. To get the next layer image, the current image is first convolved with a gaussian filter and then sub-sampled. This helps to reduce aliasing effects.

Where G is a gaussian filter, which represents the convolution operation. Sub-sampling is then performed, typically taking every other pixel of the artwork, reducing the size of each dimension to half.

The number of layers or depth of the pyramid depends on the application requirements. In an ORB, a fixed depth, such as 8 layers, is typically chosen to ensure that keypoints are detected at various scales.

(32) For each pixel P, a circular area of 16 pixels around it is considered. If, of the 16 pixels, there are n consecutive pixels whose luminance is significantly higher or significantly lower than P, P is considered to be a corner point.

(33) A pixel P (x, y) and a luminance threshold t are selected. Calculator brightness value I _P . Considering a circle of 16 pixels around P, first check pixels 1,5,9, 13 if the luminance value of all 4 pixels is greater than I _P +t or all less than I _P -t, the other pixels are further examined.

(34) Corner detection by Harris

Harris corner response is based on one concept: if the brightness of the image changes significantly when the window is moved in all directions, the pixel in the center of the window may be a corner point. For image I, consider the brightness change caused by the small offset (dx, dy) of window w (x, y) at the (x, y) position, as shown in the following equation.

I _x And I _y Is the inverse of the image in the x and y directions.

Two eigenvalues λ of matrix M ₁ And lambda (lambda) ₂ (wherein lambda ₁ ≥λ ₂ ) The variation of the image in two main directions is described.

Harris response is defined as:

R＝det(M)-k(trace(M)) ²

wherein det (M) =λ ₁ λ ₂ ，trace(M)＝λ ₁ +λ ₂ . The constant k typically takes a value between 0.04 and 0.06. When the R value is small (whether positive or negative), it means λ ₁ And lambda (lambda) ₂ Are small and this area is a flat area. When R is positive and large, it means λ ₁ And lambda (lambda) ₂ Are large and close, the region being a corner point. When R is negative, it means that one of the lambda values is small and the other lambda value is large, the region is an edge.

(35) All detected corner points are sorted in descending order according to their Harris response values R. This means that the corner with the largest Harris response is at the front of the list. The ORB feature extractor selects not only the n keypoints with highest response values. To ensure that the extracted keypoints are evenly distributed in the image, the ORB will divide the image into multiple grids, for each of which the keypoint with the highest Harris response is selected. If the number of key points in a certain grid is excessive, only the k points with highest response are reserved. Where k depends on the number of keypoints that one wishes to extract from each grid. If the number of keypoints within a grid is too small or there are no keypoints, then "borrowing" of keypoints from neighboring grids may be considered.

(36) For each keypoint, the ORB calculates the moment of its surrounding area. Specifically, considering the key point p (x, y), the surrounding pixel region may be defined as a small window or a circular region. For each pixel q in this region _i (i, j) calculating the central moment of its brightness weighting:

wherein I (q) _i ) Is pixel q _i Is a luminance value of (a).

Using the center moment calculated above, the direction of the key point θ _p The calculation can be made by the following formula:

once the direction is determined for each keypoint, its binary descriptor can be rotated accordingly, giving it rotational invariance. When the matched feature points are found in another image, the descriptors are rotated to the reference direction, so that the rotated key points can be correctly matched.

(37) BRIEF descriptor generates binary characters by simple luminance comparison of pixels around a keypointStrings. Specifically, a pair of pixel points p (x _p ,y _p ) And q (x) _q ,y _q ) If I (p)<I (q), then the corresponding bit of the descriptor is 1, otherwise 0.

(38) Around a fixed keypoint, N pairs of pixels (e.g., n=256) are randomly selected for comparison. The selection of these pixel pairs is predefined and can be multiplexed over multiple images. The ORB assigns a direction to each keypoint. In order for the BRIEF descriptor to have rotational invariance, it is necessary to rotate the predefined pairs of pixels according to the direction. Thus, even if the image is rotated, the same descriptors can be generated for the same key points after rotation. Assuming that the direction of the key point is θ, the pixel point p (x _p ,y _p ) Rotate to a new position p ^` (x _p ^` ,y _p ^` ) The conversion of (2) can be expressed by the following formula:

`

x _p ＝x _p cos(θ)-y _p sin(θ)

`

y _p ＝x _p sin(θ)+y _p cos(θ)

(39) The ORB descriptors are binary, meaning that the similarity between feature points can be evaluated by computing the hamming distance of their descriptors. Hamming distance is a method of measuring the difference between two strings, and in particular, it represents the number of characters that differ in their corresponding positions between two strings of equal length. For binary descriptors of ORB, the Hamming distance represents the number of corresponding bits that differ between two descriptors.

Provided with two ORB descriptors D ₁ And D ₂ They are binary vectors of length N, where N is typically 256. Hamming distance H (D) ₁ ,D ₂ ) Is D ₁ And D ₂ The number of different bits between them. Mathematically, the difference between these two vectors can be calculated using an exclusive or operation (XOR), and then the number of 1's in the result is counted:

wherein D is ₁ And D ₂ Is the i-th bit of the descriptor and is the XOR operation.

For one feature point in a given image, the feature point in another image that is most similar to it is found by looking for the descriptor with the smallest hamming distance. Specifically, given a descriptor D ₁ Searching a descriptor D in all descriptor sets of another image ₂ So that H (D ₁ ,D ₂ ) Minimum. To filter out false matches, a threshold value for the hamming distance is set. Only when the hamming distance between two descriptors is less than this threshold is they considered a match. To increase the robustness of the matching, a ratio test is applied here. Such testing is based on the observation that: for a good match, there should be a significant gap in the hamming distance between the nearest descriptor and the next nearest descriptor. Specifically, for descriptor D ₁ Except for finding its nearest neighbor D ₂ ⁺ It is also necessary to find the next neighbor D ₂ ^- . The following ratios were then calculated:

if this ratio is less than a certain threshold, the match is considered to be robust.

Based on the actual situation, if the number of feature point extractions is too large, the case of mismatching is increased, and since many feature points on the ground are extracted, the number of feature point extractions is set to 1000. Further, the near-far threshold is set to 30 to prevent the influence when the image is changed in the near-far direction.

Further, the step (4) is specifically as follows:

(41) The image is smoothed using a gaussian filter, reducing noise in the image. The mathematical expression of the gaussian filter is:

where x and y are pixel locations and σ is the standard deviation of the gaussian filter.

The gradient amplitude and direction of each pixel point are calculated, and the Sobel operator is applied in the horizontal and vertical directions to finish the calculation. The calculation formulas of the gradient amplitude G and the gradient direction Θ are as follows:

wherein G is _x And G _y The gradients of the image in the x and y directions, respectively.

Each pixel in the image is traversed and non-edge pixels are eliminated, leaving only the pixel points where the gradient magnitude is locally maximum in the gradient direction.

Two thresholds are used: a low threshold and a high threshold to detect and connect edges. First, all pixels with gradient magnitudes above the high threshold are considered strong edges. Those pixels below the low threshold are excluded. The pixel point between the two thresholds determines whether the pixel point is an edge according to the connection condition of the pixel point and the strong edge.

(42) The straight lines in the image space are converted to points in the parameter space using hough variations. In image space, a straight line can be represented as y=mx+b where m is the slope and b is the y-axis intercept, which can be problematic when dealing with vertical lines. Thus, the hough transform uses a polar coordinate system to represent a straight line, using two parameters: r and θ. The polar equation for a straight line is r=xcos θ+ysin θ, where r is the perpendicular distance of the straight line from the origin and θ is the angle of the straight line from the x-axis. For each point in the image and a series of θ values, a corresponding r value is calculated and a count is incremented in the accumulator space (typically a two-dimensional array, corresponding to different r and θ values). Each pixel point forms a sinusoid in the parameter space. When a straight line exists in the image, curves corresponding to all points on the straight line in the parameter space are intersected at a certain point, and r and theta values corresponding to the intersection point represent parameters of the straight line. In practical application of the present invention, two end coordinates of a stored line segment represent the line segment.

(43) The matching of line segment features is based on the position of the line segment center point and the angle of the line segment, using the Lv Puke coordinates [ l, m, n, a, b, c ], [ l, m, n ] representing the direction of the line segment and [ a, b, c ] representing the point on the line segment. According to the invention, the line segments are matched according to the distance between the central points and the angle difference.

Further, the step (5) is specifically as follows:

the texture analysis of the image is performed using the gray co-occurrence matrix, and first a specific direction and distance are selected. Then, at this direction and distance, the frequency of occurrence of each gray-level pixel pair (i, j) in the image is counted. The result is a matrix in which the number of rows and columns is equal to the number of gray levels in the image, with each element in the matrix representing the frequency or probability of the occurrence of a particular pixel pair throughout the image. If the image has N gray levels, the gray co-occurrence matrix is an N x N matrix M, where each element M (i, j) of the matrix represents the number or probability of co-occurrence of a pixel having a gray value i and a pixel having a gray value j according to a defined spatial relationship (direction and distance). Once the gray level co-occurrence matrix is calculated, a variety of texture features are extracted from it, with contrast reflecting the extent of local gray level variation in the image.

Uniformity measures the degree of uniformity of the image texture.

The correlation measures the correlation between gray levels in the image.

Wherein mu _i Sum mu _j Is the average of the gray levels i and j, σ _i Sum sigma _j Is their standard deviation.

Entropy measures the irregularity or complexity of image textures.

The distribution value is determined by using the uniformity and entropy of the gray level co-occurrence matrix, when the uniformity is reduced and the entropy is increased, the texture of the image is rich, more point features are selected as much as possible, the calculation efficiency of the system is improved, and otherwise, more line segment features are selected to ensure the accuracy.

Further, the step (6) is specifically as follows:

the binocular camera estimates the three-dimensional position of the feature through the views of the two cameras, and when two-dimensional image points corresponding to n three-dimensional space points are known in the camera, the pose and the position of the camera relative to the three-dimensional points are solved. The goal of the PnP problem is to recover the pose of the camera from a corresponding set of 3D-2D point pairs. Typically, at least 3 point pairs are needed to solve this problem, but in order to increase robustness, more point pairs are typically used. The EPnP method is used for solving, which reduces the complexity of the problem by representing spatial points as linear combinations of virtual points. Then, using a graph optimization method, taking the estimated pose of the camera and the three-dimensional position of the feature as two nodes, and taking the projection relation of the two-dimensional image point and the three-dimensional space point as the edge connecting the two-dimensional image point and the three-dimensional space point, so as to simultaneously optimize the pose of the camera and the three-dimensional space point.

Further, the step (7) is specifically as follows:

first, all models that the system may follow are determined.

For each model j in the model set, a probability p of switching from model i to model j is calculated _ij This is a model transfer probabilityA rate matrix. For each model j, its model probability μ is updated _j The following are provided:

wherein the method comprises the steps ofIs based on a likelihood function of the model j that measures the degree of matching between the observed value and the predicted value of the model. For each model, the state is predicted using the state transition matrix of that model. Then, the updating step using the kalman filter updates the prediction state based on the observed value.

Mixing the state estimates of all models together to obtain a total state estimate:

wherein,is based on a state estimate of model j.

Likewise, the covariance of all models is mixed to obtain the total covariance estimate.

Wherein P is _j Is based on covariance estimation of model j.

The beneficial effects of the invention are as follows:

the invention discloses a visual inertial integrated navigation algorithm based on dotted line characteristics, which mainly aims at complex motion or extreme flight dynamics, firstly extracts, describes and matches characteristic points and characteristic line segments of images respectively, then fuses visual information and information of an inertial odometer based on a multi-mode estimation algorithm, can improve calculation efficiency, realizes accurate position and direction tracking, has more flexible characteristic selection and has better adaptability to the environment.

Drawings

FIG. 1 is a flow chart of the overall algorithm framework;

FIG. 2 is a flow chart of vision and IMU data fusion;

FIG. 3 is a result graph of an outdoor dataset;

FIG. 4 is a result graph of an in-run dataset;

fig. 5 is a graph of the results of the actual measurement.

Detailed Description

The present invention is further illustrated in the following drawings and detailed description, which are to be understood as being merely illustrative of the invention and not limiting the scope of the invention.

As shown in the figure, the visual inertial integrated navigation algorithm based on the dotted line features comprises the following steps:

pretreatment of IMU: the IMU may provide data at a very high frequency (e.g., 200Hz or higher), while the sampling frequency of the camera is typically lower (e.g., 30 Hz). If the original high frequency IMU data is used directly in the state estimation, this results in a dramatic increase in computation. By pre-integrating the IMU data, pose change information obtained by accumulating (integrating) high-frequency data of the IMU between two adjacent frames of images is utilized to convert continuous IMU measurement data into relative motion (including position, speed and pose changes) between the discrete camera frames, so that the number of variables involved in optimization is reduced, and the computational complexity is reduced.

In a vision-IMU fusion system, time synchronization is a critical process to ensure that IMU data and camera frames are aligned in time. Since the data acquisition rates of IMUs and cameras are typically different (IMUs typically have much higher data frequencies than cameras), interpolation or other methods are required to ensure data synchronization.

The following expressions are assumed:

t _c : time stamp of current camera frame

t _i-1 And t _i : respectively is with t _c Recently, the method of the present inventionTime stamping of two IMU data, where t _i-1 <t _c <t _i

s _i-1 Sum s _i : respectively at t _i-1 And t _i IMU data samples of (a)

For a given camera frame time stamp t _c IMU data at this time is acquired. One common approach is to use linear interpolation. For any given quantity s, (such as acceleration or angular velocity), at time t _c Can be expressed as an interpolation of (a).

This formula simply takes into account the variation between two adjacent IMU data samples and estimates the time stamp t at the camera frame _c Values at that time.

Processing of feature points of an image: to achieve scale invariance, an image pyramid is first created. This involves generating multiple scale versions of the image, typically each new scale being a scaled down version of some fixed factor of the original image. In this way, even if the object changes in size in the image, its features can be detected on different scales.

An image pyramid consists of a series of progressively smaller image layers. The bottom layer is an original size image, while each higher level layer is a scaled-down version of the previous layer. Set I ₀ For the original image, for level I (starting from 1) of each pyramid, image I _i By downsampling the image I _i-1 Obtained. To get the next layer image, the current image is first convolved with a gaussian filter and then sub-sampled. This helps to reduce aliasing effects.

For each pixel P, a circular area of 16 pixels around it is considered. P is considered to be a corner if of the 16 pixels there are n consecutive pixels (typically n=12) whose luminance is significantly higher than P (above a given threshold) or significantly lower than P.

A pixel P (x, y) and a luminance threshold t are selected. Calculator brightness value I _P (typically, the luminance value l=0.299r+0.587g+0.114 b is calculated using a weighted average). Considering a circle of 16 pixels around P, first check pixels 1,5,9, 13 if the luminance value of all 4 pixels is greater than I _P +t or all less than I _P -t, the other pixels are further examined. Since FAST first examines only 4 pixels and considers other pixels only when these 4 pixels meet the condition, it is computationally FAST. Most non-corner pixels can be rapidly excluded at this preliminary inspection stage.

Harris corner detection is a common corner detection method that determines corners based on local variations of images. In ORB, although FAST is used to detect corner points, harris corner responses are used to determine the intensity of these corner points and rank and filter based on these response values.

I _x And I _y Is the inverse of the image in the x and y directions.

Two eigenvalues λ of matrix M ₁ And lambda (lambda) ₂ (wherein lambda ₁ ≥λ ₂ ) The variation of the image in two main directions is described. Harris response is defined as:

R＝det(M)-k(trace(M)) ²

wherein det (M) =λ ₁ λ ₂ ，trace(M)＝λ ₁ +λ ₂ . The constant k typically takes a value between 0.04 and 0.06.

When the R value is small (whether positive or negative), it means λ ₁ And lambda (lambda) ₂ Are small and this area is a flat area. When R is positive and large, it means λ ₁ And lambda (lambda) ₂ Are large and close, the region being a corner point. When R is negative, it means that one of the lambda values is small and the other lambda value is large, the region is an edge.

All detected corner points are sorted in descending order according to their Harris response values R. This means that the corner with the largest Harris response is at the front of the list. The ORB feature extractor selects not only the n keypoints with highest response values. To ensure that the extracted keypoints are evenly distributed in the image, the ORB will divide the image into multiple grids, such as a 5 x 5 grid. For each grid, the keypoints with the highest Harris response are selected from the grid. If the number of key points in a certain grid is excessive, only the k points with highest response are reserved. Where k depends on the number of keypoints that one wishes to extract from each grid. If the number of keypoints within a grid is too small or there are no keypoints, then "borrowing" of keypoints from neighboring grids may be considered.

For each keypoint, the ORB calculates the moment of its surrounding area. Specifically, considering the key point p (x, y), the surrounding pixel region may be defined as a small window or a circular region. For each pixel q in this region _i (i, j) calculating the central moment of its brightness weighting:

wherein I (q) _i ) Is pixel q _i Is a luminance value of (a).

The BRIEF descriptor generates a binary string by simple luminance comparison of pixels around the keypoint. Specifically, a pair of pixel points p (x _p ,y _p ) And q (x) _q ,y _q ) If I (p)<I (q), then the corresponding bit of the descriptor is 1, otherwise 0.

Around a fixed keypoint, N pairs of pixels (e.g., n=256) are randomly selected for comparison. The selection of these pixel pairs is predefined and can be multiplexed over multiple images. The ORB assigns a direction to each keypoint. In order to make the BRIEF descriptor rotationally invariant, it is necessary to rotate the predefined images according to the direction A prime pair. Thus, even if the image is rotated, the same descriptors can be generated for the same key points after rotation. Assuming that the direction of the key point is θ, the pixel point p (x _p ,y _p ) Rotate to a new position p ^` (x _p ^` ,y _p ^` ) The conversion of (2) can be expressed by the following formula:

`

x _p ＝x _p cos(θ)-y _p sin(θ)

`

y _p ＝x _p sin(θ)+y _p cos(θ)

the ORB descriptors are binary, meaning that the similarity between feature points can be evaluated by computing the hamming distance of their descriptors. Hamming distance is a method of measuring the difference between two strings, and in particular, it represents the number of characters that differ in their corresponding positions between two strings of equal length. For binary descriptors of ORB, the Hamming distance represents the number of corresponding bits that differ between two descriptors.

Provided with two ORB descriptors D ₁ And D ₂ They are binary vectors of length N, where N is typically 256. Hamming distance H (D) ₁ ,D ₂ ) Is D ₁ And D ₂ The number of different bits in between. Mathematically, the difference between these two vectors can be calculated using an exclusive or operation (XOR), and then the number of 1's in the result is counted:

For one feature point in a given image, the feature point in the other image that is most similar to it is found. This can be achieved by looking for the descriptor with the smallest hamming distance. Specifically, given a descriptor D ₁ Searching a descriptor D in all descriptor sets of another image ₂ So that H (D ₁ ,D ₂ ) Minimum of. To filter out false matches, a threshold value for the hamming distance is set. Only when the hamming distance between two descriptors is less than this threshold is they considered a match.

To increase the robustness of the matching, a ratio test is applied here. Such testing is based on the observation that: for a good match, there should be a significant gap in the hamming distance between the nearest descriptor and the next nearest descriptor. Specifically, for descriptor D ₁ Except for finding its nearest neighbor D ₂ ⁺ It is also necessary to find the next neighbor D ₂ ^- . The following ratios were then calculated:

if this ratio is less than a certain threshold (e.g. 0.8), the match is considered to be robust. This approach helps to filter out those false matches that result from noise or other interference.

Processing of image characteristic line segments:

the image is smoothed using a gaussian filter, reducing noise in the image. The mathematical expression of the gaussian filter is:

/>

The straight lines in the image space are converted to points in the parameter space using hough variations. In image space, a straight line can be expressed as y=mx+b where m is the slope and b is the y-axis intercept, which can be problematic when dealing with vertical lines (slope infinity). Thus, the hough transform uses a polar coordinate system to represent a straight line, using two parameters: r and θ. The polar equation for a straight line is r=xcos θ+ysin θ, where r is the perpendicular distance of the straight line from the origin and θ is the angle of the straight line from the x-axis.

For each point in the image and a series of θ values, a corresponding r value is calculated and a count is incremented in the accumulator space (typically a two-dimensional array, corresponding to different r and θ values). Each pixel point (assuming edge points) forms a sinusoid in the parameter space. When a straight line exists in the image, curves corresponding to all points on the straight line in the parameter space are intersected at a certain point, and r and theta values corresponding to the intersection point represent parameters of the straight line.

Assuming that two sets of line segments are provided, each set of line segments is from a different image or a different view, to find a match between the two sets of line segments, a feature vector is first defined for each line segment, and the feature vector may include the length, direction, and position of the line segment. The number of iterations the RANSAC is to perform and the distance threshold are set. For each iteration, randomly selecting one line segment from the first group of line segments as a sample, and finding the line segment closest to the selected sample feature in the second group of line segments by calculating Euclidean distance between feature vectors. For each line segment in the second set, if the distance between its eigenvector and the eigenvector of the selected sample in the first set is less than the set threshold, then that line segment is considered to be an interior point. If the number of inliers generated by the current iteration exceeds any previous iteration, the current sample and all inliers are recorded. After all iterations are completed, the sample and its matching set that yields the most interior points are selected as the final matching result.

Distribution of the number of dotted line features:

the texture analysis of the image is performed using the gray co-occurrence matrix, and first a specific direction and distance are selected. For example, the direction may be 0 degrees (horizontal), 45 degrees, 90 degrees (vertical), or 135 degrees, and the distance may be adjacent pixels (typically 1). Then, at this direction and distance, the frequency of occurrence of each gray-level pixel pair (i, j) in the image is counted. The result is a matrix in which the number of rows and columns is equal to the number of gray levels in the image, with each element in the matrix representing the frequency or probability of the occurrence of a particular pixel pair throughout the image. If the image has N gray levels, the gray co-occurrence matrix is an N x N matrix M, where each element M (i, j) of the matrix represents the number or probability of co-occurrence of a pixel having a gray value i and a pixel having a gray value j according to a defined spatial relationship (direction and distance). Once the gray level co-occurrence matrix is calculated, various texture features can be extracted from it, such as the degree to which contrast reflects local gray level variations in the image.

Uniformity measures the degree of uniformity of the image texture.

The correlation measures the correlation between gray levels in the image.

Entropy measures the irregularity or complexity of image textures.

Motion estimation: the binocular camera can estimate the three-dimensional position of the feature from the views of the two cameras, and when knowing the two-dimensional image points corresponding to the n three-dimensional spatial points within the camera, solve for the pose (rotation) and position (translation) of the camera relative to these three-dimensional points. The goal of the PnP problem is to recover the pose of the camera from a corresponding set of 3D-2D point pairs. Typically, at least 3 point pairs are needed to solve this problem (i.e. 3 PnP), but in order to increase robustness, more point pairs are typically used. The EPnP method is used for solving, which reduces the complexity of the problem by representing spatial points as linear combinations of virtual points. Then, using a graph optimization method, taking the estimated pose of the camera and the three-dimensional position of the feature as two nodes, and taking the projection relation of the two-dimensional image point and the three-dimensional space point as the edge connecting the two-dimensional image point and the three-dimensional space point, so as to simultaneously optimize the pose of the camera and the three-dimensional space point.

Data fusion: an interactive multimodal algorithm is a method of estimating the state of a nonlinear, multimodal system. The method is mainly applied to the field of target tracking and other related fields, particularly when a motion model of a target can be switched between several different models. IMM estimates the state by weighted combining of all considered models.

First, all models that the system may follow are determined. For example, one target may follow a "uniform linear motion" model or a "turn" model. For each model there is a corresponding state transition matrix and measurement matrix.

For each model j in the model set, a probability p of switching from model i to model j is calculated _ij This is a model transition probability matrix.

For each model j, its model probability μ is updated _j The following are provided:

Wherein,is based on a state estimate of model j.

Wherein P is _j Is based on covariance estimation of model j.

Several points are noted:

1. the internal reference of a binocular camera is known, and distortion correction, noise removal, and smoothing of an image are performed first. Then firstly, calculating the texture richness of the image, distributing the quantity of feature points and feature line segments to be extracted according to the data, and extracting the feature points and the feature line segments of the image, describing the feature points and the feature line segments, and matching the feature points and the feature line segments at one time. According to the matching features of the left camera and the right camera, three-dimensional coordinates of the features are obtained by combining a triangulation method, the three-dimensional coordinates are stored in a local map, then the obtained features of each frame of image are matched with the three-dimensional features of the local map, the position and the posture of the camera are calculated by using an EPnP method of PnP problem as initial values, then the initial values and the three-dimensional features of the local map are used together to perform the optimization only by using a graph optimization method, the local map is updated every time the image of each frame is received, newly extracted three-dimensional features are added, and the three-dimensional features which are not detected all the time are removed. In the whole process, a global map is also generated, and the round-trip loop detection is used for eliminating accumulated errors. And finally, fusing the binocular vision and IMU data by using an interactive multi-model algorithm, wherein the state value is the pose estimated by the camera and the IMU, and the observed value is the three-dimensional feature stored by the local map.

2. In the whole process, a plurality of key points are provided, firstly, the extracted features are not more and better, the extracted features are too many, the invention finds that the extracted features are too many when in actual measurement, and a large part of features are extracted to the ground, so that the number of mismatching is increased, finally, the calculated displacement is much smaller than the actual situation, and secondly, the features are most easily lost when the light-dark conversion and the far-near conversion are carried out, at the moment, if the set parameters are unsuitable, the system is easy to lose, and for the two cases, the invention processes the conditions that the number of the extracted features is reduced (the invention is set at 1000), and meanwhile, the threshold value of far-near recognition is increased (the invention is set at 80), so that the motion can be estimated better.

3. In order to solve the problem, the invention refers to ORB-SLAM3 and VINS-Fusion, which are found to run directly, and then analyzes the data measured by the sensor, and finds that some measured deviation of the IMU is larger, for example, the usual deviation of the Z-axis acceleration data is 100%, the deviation of the highest place is 300%, which is truly related to the carrier itself for the test of the invention (obviously the carrier is bumpy up and down in the whole test process), so that since the visual navigation itself only has to correct the course angle and displacement estimation, the invention only uses the data of the IMUx and y-axis accelerometers, and the data of the Z-axis gyroscope only has to enlarge the trust value of the observer when the monotone change of the course angle is detected when the data Fusion is carried out (the method is only suitable for the use of ground vehicles).

It should be noted that the foregoing merely illustrates the technical idea of the present invention and is not intended to limit the scope of the present invention, and that a person skilled in the art may make several improvements and modifications without departing from the principles of the present invention, which fall within the scope of the claims of the present invention.

Claims

1. The visual inertial integrated navigation algorithm based on the dotted line features is characterized in that: the method comprises the following steps:

(1) Pretreatment of the IMU;

(2) Extracting an image;

(3) Processing characteristic points of the image;

(4) Processing characteristic line segments of the image;

(5) Distribution of the number of dot line features;

(6) Motion estimation;

(7) And (5) data fusion.

2. The visual inertial integrated navigation algorithm based on dotted line features of claim 1, wherein: the step (1) is specifically as follows:

(11) The measured data are obtained from the IMU, and the data obtained from the six-axis IMU are the angular velocity (omega/s) and the acceleration (m/s) in three directions of x, y and z ² )；

(12) Preprocessing the measured six-axis IMU data;

the IMU provides data at a very high frequency, and the pose change information obtained by accumulating the high-frequency data of the IMU between two adjacent frames of images is utilized to convert continuous IMU measurement data into relative motion between the discrete camera frames by pre-integrating the IMU data, so that the variable quantity participating in optimization is reduced, and the calculation complexity is reduced;

(13) Performing visual and IMU time stamp alignment;

in a vision-IMU fusion system, time synchronization is a critical process to ensure that IMU data and camera frames are aligned in time; because the IMU and camera data acquisition rates are typically different, interpolation or other methods need to be used to ensure data synchronization;

the following expressions are assumed:

t _c : a timestamp of the current camera frame;

s _i-1 Sum s _i : respectively at t _i-1 And t _i Is a sample of IMU data;

for a given camera frame time stamp t _c Acquiring IMU data at the moment; using a linear interpolation method; for any given quantity s, (such as acceleration or angular velocity), at time t _c Is expressed as:

3. The visual inertial integrated navigation algorithm based on dotted line features of claim 1, wherein: the step (2) is specifically as follows:

4. The visual inertial integrated navigation algorithm based on dotted line features of claim 1, wherein: the step (3) is specifically as follows:

(31) To achieve scale invariance, an image pyramid is first created; this involves generating multiple scale versions of the image, typically each new scale being a scaled down version of some fixed factor of the original image; thus, even if the object changes in size in the image, its features can be detected on different scales;

the image pyramid is composed of a series of image layers that gradually shrink; the bottom layer is an original size image, while each higher level layer is a scaled-down version of the previous layer; set I ₀ For the original image, for level I of each pyramid, image I _i By downsampling the image I _i-1 Obtaining; to obtain the next layer of image, the current image is first convolved with a gaussian filter and then sub-sampled; this helps to reduce aliasing effects:

wherein G is a gaussian filter, which represents a convolution operation; sub-sampling is then performed, typically taking every other pixel of the original image, reducing the size of each dimension to half:

the number of layers or depth of the pyramid depends on the application requirements; in ORB, a fixed depth, such as 8 layers, is typically chosen to ensure that keypoints are detected at various scales;

(32) For each pixel P, a circular area of 16 pixels around it is considered; if there are n consecutive pixels of the 16 pixels whose luminance is significantly higher than P or significantly lower than P, then P is considered to be a corner point;

(33) Selecting a pixel P (x, y) and a luminance threshold t; calculator brightness value I _P The method comprises the steps of carrying out a first treatment on the surface of the Considering a circle of 16 pixels around P, first check pixels 1,5,9, 13 if the luminance value of all 4 pixels is greater than I _P +t or all less than I _P -t, further checking other pixels;

(34) Corner detection by Harris

Harris corner response is based on one concept: if the brightness of the image changes significantly when the window moves in all directions, the pixel in the center of the window may be a corner point; for image I, consider the brightness change due to small offset (dx, dy) of window w (x, y) at the (x, y) position, as shown in the following equation:

E(dx,dy)＝∑ _xy w(x,y)[I(x+dx,y+dy)-I(x,y)] ²

I _x and I _y Is the reciprocal of the image in the x and y directions;

two eigenvalues λ of matrix M ₁ And lambda (lambda) ₂ (wherein lambda ₁ ≥λ ₂ ) Changes in the image in two principal directions are described;

harris response is defined as:

R＝det(M)-k(trace(M)) ²

wherein det (M) =λ ₁ λ ₂ ，trace(M)＝λ ₁ +λ ₂ The method comprises the steps of carrying out a first treatment on the surface of the The constant k generally takes a value between 0.04 and 0.06; when the R value is small (whether positive or negative), it means λ ₁ And lambda (lambda) ₂ Are small, and the area is a flat area; when R is positive and large, it means λ ₁ And lambda (lambda) ₂ Are very large and close, and the area is a corner point; when R is negative, it means that one of the lambda values is small and the other lambda value is large, the region is an edge;

(35) All the detected corner points are sorted in descending order according to the Harris response value R; this means that the corner with the largest Harris response is at the front of the list; the ORB feature extractor not only selects n key points with highest response values; to ensure that the extracted keypoints are evenly distributed in the image, the ORB will divide the image into a plurality of grids, for each grid, select the keypoint from the grid with the highest Harris response; if the number of key points in a certain grid is excessive, only k points with highest response are reserved; where k depends on the number of keypoints that one wishes to extract from each grid; if the number of keypoints within a grid is too small or there are no keypoints, then it may be considered to "borrow" keypoints from neighboring grids;

(36) For each keypoint, the ORB computes the moment of its surrounding area; specifically, considering the key point p (x, y), the surrounding pixel region may be defined as a small window or a circular region; for each pixel q in this region _i (i, j) calculating the central moment of its brightness weighting:

wherein I (q) _i ) Is pixel q _i Is a luminance value of (1);

using the center moment calculated above, the direction of the key point θ _p Calculated by the following formula:

once the direction is determined for each keypoint, the binary descriptor thereof is rotated accordingly, making it rotation invariant; when the matched characteristic points are searched in another image, the descriptors are rotated to the reference direction, so that the rotated key points can be correctly matched;

(37) The BRIEF descriptor generates a binary string by performing a simple luminance comparison of pixels around the keypoint; specifically, a pair of pixel points p (x _p ,y _p ) And q (x) _q ,y _q ) If I (p)<I (q), then the corresponding bit of the descriptor is 1, otherwise 0;

(38) Around a fixed keypoint, N pairs of pixels (e.g., n=256) are randomly selected for comparison; the selection of these pixel pairs is predefined and can be multiplexed over multiple images; ORB assigns a direction to each keypoint; in order for the BRIEF descriptor to have rotational invariance, it is necessary to rotate the predefined pixel pairs according to the direction; thus, even if the image rotates, the same descriptors can be generated for the same key points after rotation; assuming that the direction of the key point is θ, the pixel point p (x _p ,y _p ) Rotate to a new position p' (x _p `,y _p ') can be expressed by the following formula:

x _p `＝x _p cos(θ)-y _p sin(θ)

y _p `＝x _p sin(θ)+y _p cos(θ)

(39) The ORB descriptors are binary, meaning that the similarity between feature points can be evaluated by computing the hamming distance of their descriptors; hamming distance is a method for measuring the difference between two strings, and specifically, it represents the number of characters at the corresponding positions between two strings of equal length; for binary descriptors of ORB, the Hamming distance represents the number of corresponding bits that differ between two descriptors;

provided with two ORB descriptors D ₁ And D ₂ They are binary vectors of length NWhere N is typically 256; hamming distance H (D) ₁ ,D ₂ ) Is D ₁ And D ₂ The number of different bits between; mathematically, the difference between these two vectors can be calculated using an exclusive or operation (XOR), and then the number of 1's in the result is counted:

wherein D is ₁ And D ₂ Is the i-th bit of the descriptor,is an XOR operation;

for one feature point in a given image, finding the feature point in the other image that is most similar to the feature point, by finding the descriptor with the smallest hamming distance; specifically, given a descriptor D ₁ Searching a descriptor D in all descriptor sets of another image ₂ So that H (D ₁ ,D ₂ ) Minimum; setting a threshold value of a hamming distance for filtering out false matches; only if the hamming distance between two descriptors is less than this threshold, they are considered a match; to improve the robustness of the matching, a ratio test is applied here; such testing is based on the observation that: for a good match, there should be a significant gap in the hamming distance between the nearest descriptor and the next nearest descriptor; specifically, for descriptor D ₁ Except for finding its nearest neighbor D ₂ ⁺ It is also necessary to find the next neighbor D ₂ ^- The method comprises the steps of carrying out a first treatment on the surface of the The following ratios were then calculated:

if this ratio is less than a certain threshold, then the match is considered to be robust;

based on the actual situation, if the number of feature point extractions is too large, the situation of mismatching is increased, and because a large number of feature points on the ground are extracted, the number of feature point extractions is set to be 1000; further, the near-far threshold is set to 30 to prevent the influence when the image is changed in the near-far direction.

5. The visual inertial integrated navigation algorithm based on dotted line features of claim 1, wherein: the step (4) is specifically as follows:

(41) Smoothing the image using a gaussian filter to reduce noise in the image; the mathematical expression of the gaussian filter is:

Where x and y are pixel positions and σ is the standard deviation of the gaussian filter;

calculating the gradient amplitude and direction of each pixel point, and applying a Sobel operator in the horizontal and vertical directions to finish the process; the calculation formulas of the gradient amplitude G and the gradient direction Θ are as follows:

wherein G is _x And G _y Gradients of the image in x and y directions, respectively;

traversing each pixel in the image, eliminating non-edge pixels, and only reserving pixel points with the gradient amplitude being locally maximum in the gradient direction;

two thresholds are used: a low threshold and a high threshold to detect and connect edges; first, all pixels with gradient magnitudes above the high threshold are considered strong edges; those pixels below the low threshold are excluded; the pixel point between the two thresholds determines whether the pixel point is an edge according to the connection condition of the pixel point and the strong edge;

(42) Converting a line in the image space to a point in the parameter space using the hough change; in image space, a straight line can be represented as y=mx+b where m is the slope and b is the y-axis intercept, which can be problematic when dealing with vertical lines; thus, the hough transform uses a polar coordinate system to represent a straight line, using two parameters: r and θ; the polar equation of the straight line is r=xcos theta+ysin theta, wherein r is the vertical distance from the straight line to the origin, and theta is the included angle between the straight line and the x axis; for each point in the image and a series of θ values, calculating a corresponding r value and incrementing a count in the accumulator space; each pixel point forms a sine curve in the parameter space; when a straight line exists in the image, curves corresponding to all points on the straight line in a parameter space are intersected at a certain point, and r and theta values corresponding to the intersection point represent parameters of the straight line; in the practical application of the invention, storing two end point coordinates of the line segment to represent the line segment;

(43) The matching of line segment features is based on the position of the line segment center point and the angle of the line segment, using herein Lv Puke coordinates [ l, m, n, a, b, c ], [ l, m, n ] representing the direction of the line segment, [ a, b, c ] representing the point on the line segment; according to the invention, the line segments are matched according to the distance between the central points and the angle difference.

6. The visual inertial integrated navigation algorithm based on dotted line features of claim 1, wherein: the step (5) is specifically as follows:

using gray level co-occurrence matrix to analyze the texture of the image, firstly selecting a specific direction and distance; then, at this direction and distance, counting the frequency of occurrence of each gray-level pixel pair (i, j) in the image; the result is a matrix in which the number of rows and columns is equal to the number of gray levels in the image, each element in the matrix representing the frequency or probability of the occurrence of a particular pixel pair throughout the image; if the image has N gray levels, the gray level co-occurrence matrix is an N x N matrix M, wherein each element M (i, j) of the matrix represents the number or probability that a pixel with a gray value of i and a pixel with a gray value of j co-occur according to a defined spatial relationship; once the gray level co-occurrence matrix is calculated, a variety of texture features are extracted from it, with contrast reflecting the extent of local gray level variation in the image:

Contrast＝∑ _i,j (i-j) ² M(i,j)

Uniformity measures the degree of uniformity of the image texture:

Uniformity＝∑ _i,j M(i,j) ²

the correlation measures the correlation between gray levels in the image:

wherein mu _i Sum mu _j Is the average of the gray levels i and j, σ _i Sum sigma _j Is their standard deviation;

entropy measures the irregularity or complexity of image textures:

Entropy＝-∑ _i,j M(i,j)log(M(i,j))

7. The visual inertial integrated navigation algorithm based on dotted line features of claim 1, wherein: the step (6) is specifically as follows:

the binocular camera estimates the three-dimensional position of the feature through the views of the two cameras, and when two-dimensional image points corresponding to n three-dimensional space points are known in the camera, the pose and the position of the camera relative to the three-dimensional points are solved; the goal of the PnP problem is to recover the pose of the camera from a set of corresponding 3D-2D point pairs; normally, at least 3 point pairs are needed to solve this problem, but in order to increase robustness, more point pairs are typically used; solving using the EPnP method, which reduces the complexity of the problem by representing spatial points as linear combinations of virtual points; then, using a graph optimization method, taking the estimated pose of the camera and the three-dimensional position of the feature as two nodes, and taking the projection relation of the two-dimensional image point and the three-dimensional space point as the edge connecting the two-dimensional image point and the three-dimensional space point, so as to simultaneously optimize the pose of the camera and the three-dimensional space point.

8. The visual inertial integrated navigation algorithm based on dotted line features of claim 1, wherein: the step (7) is specifically as follows:

first, all models that the system may follow are determined;

for each model j in the model set, a probability p of switching from model i to model j is calculated _ij This is a model transition probability matrix; for each model j, its model probability μ is updated _j The following are provided:

wherein the method comprises the steps ofIs based on a likelihood function of the model j, which measures the degree of matching between the observed value and the predicted value of the model; for each model, predicting a state using a state transition matrix of the model; then, an updating step using a Kalman filter updates the prediction state based on the observed value;

wherein,is based on a state estimate of model j;

likewise, the covariance of all models is mixed to obtain the total covariance estimate:

wherein P is _j Is based on covariance estimation of model j.