CN113188557A

CN113188557A - Visual inertial integrated navigation method fusing semantic features

Info

Publication number: CN113188557A
Application number: CN202110467584.3A
Authority: CN
Inventors: 黄郑; 王红星; 雍成优; 朱洁; 刘斌; 吕品; 陈玉权; 何容; 吴媚; 赖际舟
Original assignee: Nanjing University of Aeronautics and Astronautics; Jiangsu Fangtian Power Technology Co Ltd
Current assignee: Nanjing University of Aeronautics and Astronautics; Jiangsu Fangtian Power Technology Co Ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-30
Anticipated expiration: 2041-04-28
Also published as: CN113188557B

Abstract

The invention discloses a visual inertial integrated navigation method fused with semantic features, which collects RGBD visual sensor data S (k) and accelerometer data at the time k

And gyroscope data

Calculating by using a visual odometer according to the visual sensor data S (k) to obtain the current pose T (k) of the camera; semantic plane feature extraction and matching between two adjacent image frames are carried out by using visual sensor data S (k); phasing using inertial sensor dataPre-integration between two adjacent image frames; optimally solving carrier navigation information by combining a semantic plane observation residual error, a visual odometer relative pose observation residual error and an inertia pre-integration residual error; and outputting carrier navigation information and camera internal parameters. The invention can effectively improve the positioning accuracy and robustness of the navigation system.

Description

Visual inertial integrated navigation method fusing semantic features

Technical Field

The invention belongs to the technical field of robot navigation, and particularly relates to a visual inertial integrated navigation method fusing semantic features.

Background

The visual SLAM algorithm becomes a great research hotspot in the field of autonomous navigation of robots due to the richness of perception information. The traditional visual SLAM method extracts the characteristics of points, lines and the like to carry out the characteristic description and pose calculation of the environment, the characteristics are described and matched through the bottom layer brightness relationship, and the structural redundancy information in the scene is not fully utilized. The single vision sensor is difficult to meet the robust positioning of the unmanned aerial vehicle in an indoor complex environment due to limited perception dimension. These all have an impact on the accuracy and reliability of the navigation system.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a visual inertial integrated navigation method fusing semantic features so as to improve the autonomous positioning precision of an unmanned aerial vehicle.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a visual inertial integrated navigation method fusing semantic features comprises the following steps:

step 1, collecting RGBD vision sensor data S (k) and accelerometer data at k moment

And gyroscope data

Step 2, resolving by using a visual odometer according to visual sensor data S (k) to obtain a current pose T (k) of the camera;

step 3, constructing a semantic plane feature map based on visual sensor data between two adjacent image frames, matching a semantic plane of the current frame with a semantic landmark in a map, and obtaining an observation relation between the current key frame and the semantic landmark;

step 4, performing inertia pre-integration between two adjacent image frames based on inertial sensor data, wherein the inertial sensor data comprises accelerometer data and gyroscope data;

and 5, performing nonlinear optimization on the optimization function to realize pose solution based on the sum of the semantic plane observation residual error, the visual odometer relative pose observation residual error and the inertia pre-integration residual error as a combined optimization function.

Step 6, outputting carrier navigation information and camera internal parameters, and returning to the step 1

Preferably, the acquiring of the current pose of the camera specifically includes: firstly, ORB features are extracted from two adjacent frames, then the relative pose between the two key frames is calculated by utilizing the ORB feature matching relationship between the two frames and pnp, and the pose T (k) of the camera at the moment k is obtained by accumulating the relative poses.

Preferably, plane information is acquired based on visual sensor data between two adjacent image frames, and semantic plane features are constructed based on semantic categories, centroids, normal directions and horizontal and vertical types of the planes:

·s_p＝{p_x,p_y,p_z}

·s_n＝{n_x,n_y,n_z,n_d}

·

·s_ccorresponding detected semantic object classes

wherein ,s_pIs the plane centroid, s_nIs a plane normal parameter, s_oAs plane class labels (horizontal/vertical), s_cA plane semantic class label that depends on the semantic object class to which the plane corresponds;

constructing a semantic plane feature map based on the constructed semantic plane features;

defining initial semantic road sign, detecting semantic plane S for each frame_kAnd respectively carrying out matching judgment on the category, the normal direction and the centroid of the initial semantic road sign to obtain the observation relation between the current key frame and the semantic road sign.

Preferably, the inertial sensor data obtained at time k includes accelerometer data from time k-1 to time k

And gyroscope data

For constructing an inertial sensor measurement model:

wherein ,n_a and n_ωAre respectively added withSpeedometer and gyroscope white noise;

and

random walk of accelerometer and gyroscope, its derivative is white noise;

is an ideal value measured by the accelerometer,

is an ideal value measured by a gyroscope; g^WIs the gravity under the navigation system;

navigating a rotation matrix from a coordinate system to a body coordinate system at a sampling moment;

a plurality of inertial sensor data are arranged between two adjacent image frames, and all the inertial sensor data between the two adjacent image frames are subjected to pre-integration in an iteration mode:

wherein ,

in order to pre-integrate the position,

in order to pre-integrate the speed,the rotation is represented by a quaternion y,

pre-integration for rotation; at the beginning of the process, the process is carried out,

and

is a non-volatile organic compound (I) with a value of 0,

is unit quaternion, R (gamma) represents the conversion of quaternion into rotation matrix,

a multiplication operation representing a quaternion;

the frequency of the visual data is consistent with the inertia pre-integral frequency to obtain the inertia pre-integral between two adjacent image frames

And

preferably, the optimization function is:

wherein the Semantic part represents the residual of the Semantic roadmap observation,

representing a semantic landmark observation error observed by a certain frame of a camera, wherein the VO part represents an observation residual error of a relative pose in a visual odometer, the third term is an IMU pre-integration estimation residual error, rho (·) represents a robust kernel function, and Σ^-1An information matrix representing the correspondence of each error amount, which is the inverse of the covariance, represents the pre-estimation of each constraint accuracy,i. j refers to the ith and jth image frames, respectively, and k refers to the kth semantic landmark.

Preferably, the semantic roadmap observation error

wherein ,

for a position observation of a semantic landmark in the camera coordinate system,

and

respectively representing the pose of the camera at the ith frame and the landmark L for variables to be optimized_kA location in the world system;

calculate e about

And

a Jacobian matrix of errors;

the pose observation residual

wherein ,

for the observation of the relative pose of the VO at the ith frame and the jth frame,

and

is a variable to be optimized;

respectively solve the e-related pose

A Jacobian matrix of errors;

calculating inertial pre-integral residual

The difference between the predicted value and the pre-integration value between two adjacent image frames is obtained.

Preferably, e in the semantic roadmap observation error is about

And

the Jacobian matrix of errors is:

wherein xi is the pose

Lie algebraic representation of [ X ]_i′ Y_i′ Z_i′]^TIs the semantic plane coordinate of i moment camera system, f_x、f_yIs the focal length in the camera's internal parameters,

is composed of

Of the rotating component of (1).

Preferably, the position viewMeasuring e-relative pose in residual error

The Jacobian matrix of errors is:

preferably, the inertial pre-integration residual error formula is:

wherein ,

a rotation matrix from a j moment navigation coordinate system to a body coordinate system;

respectively being the positions of the organism coordinate system at the time i and the time j under the navigation coordinate system;

the speed of the organism coordinate system at the time i and the speed of the organism coordinate system at the time j are respectively under the navigation coordinate system; Δ t_iThe time interval between two adjacent image frames;

quaternions of the rotation of the organism coordinate system at the time i and the time j in the navigation coordinate system are respectively;

respectively moving the accelerometers in the machine body coordinate system at the time i and the time j randomly;

the gyroscope respectively moves randomly at i moment and j moment under the coordinate system of the body, [ gamma ]]_xyzRepresenting the x, y, z components taking the quaternion y.

The invention discloses the following technical effects: when the camera is more mobile, the navigation positioning accuracy is low by only depending on the visual features for motion estimation. According to the invention, semantic features and inertia information are fused into an optimization objective function of the visual odometer, high-dimensional feature observation and extra sensor observation constraint are added for the visual odometer, relative motion in a short time is estimated, and pose optimization is carried out, so that the pose estimation precision of the visual odometer under the condition of strong mobility can be improved, and the navigation positioning precision and robustness of the visual odometer can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the invention provides a visual inertial integrated navigation method with semantic features fused, comprising the following steps:

step 1: collecting RGBD vision sensor data S (k) and accelerometer data at k moment

And gyroscope data

Step 2: and (5) calculating to obtain the current pose T (k) of the camera by using a visual odometer according to the visual sensor data S (k):

firstly, ORB features of two adjacent frames are extracted, then the relative pose between two key frames is calculated by utilizing the ORB feature matching relationship between the two frames and the Perspective-n-point (pnp), and the pose T (k) of the camera at the moment k can be obtained by accumulating the relative poses.

And step 3: semantic plane feature extraction and matching between two adjacent image frames are carried out based on the vision sensor data S (k):

data S (k) and S (k-1) are acquired by using RGBD visual sensors at the time k and the time k-1, and the corresponding semantic plane feature extraction and matching method comprises the following steps:

(a) extracting semantic information

The detection of the set semantic object is realized by using a YOLOv3 target detection algorithm, and the two-dimensional frame boundary bbox of the semantic object in the image frame is obtained.

(b) Semantic plane extraction

In the bbox region, a depth image in the visual sensor data S (k) is converted into a structured point cloud format, normal vector extraction of the point cloud is carried out by utilizing neighborhood covariance according to a structured point cloud plane segmentation algorithm based on a connected domain, and then whether the adjacent point clouds belong to the same connected domain or not is judged based on the point cloud normal vector and the distance between a tangent plane and an origin.

Using least square method to make the number of point clouds greater than a set threshold value N_minAnd carrying out plane fitting on the connected domain to obtain a plane equation of the connected domain. Therefore, the parameters (centroid { x, y, z }, normal direction n) of the plane where the semantic object surface is located can be obtained_p：{n_x,n_y,n_z}) and the object category to which the semantic plane belongs, so as to realize the extraction of the semantic plane.

(c) Semantic plane feature construction is carried out based on the extracted plane information

First using the plane normal n_pAnd a priori ground plane normal n_gAnd (3) judging the horizontal and vertical types of the planes: calculating the normal n of point cloud and the normal n of prior ground plane_gThe difference of (a): d_hor＝||n_p-n_gIf n | |_pAnd n_gDifference d of_horLess than a set level normal difference threshold t_horThen n is used_pThe plane being the normal is marked as the horizontal plane. Calculating the difference value of the plane normal and the vertical plane normal by means of dot product: d_vert＝n_p·n_gIf d is_vertLess than a set threshold t_vertIn n is_pThe plane that is normal is marked as the vertical plane.

Then, semantic plane features are constructed by utilizing semantic categories, centroids, normal directions and horizontal and vertical types of planes:

·s_p＝{p_x,p_y,p_z}

·s_n＝{n_x,n_y,n_z,n_d}

·

·s_ccorresponding detected semantic object classes

wherein ,s_pIs a plane centroid, s_nIs a plane normal parameter, s_oAs plane class labels (horizontal/vertical), s_cIs a plane semantic class label which depends on the semantic object class corresponding to the plane.

(d) And constructing a semantic plane feature map based on the semantic plane features, and matching the semantic plane of the current frame with the semantic road signs in the map.

Based on the semantic plane obtained in step (b):

wherein ,S_iA semantic plane is represented that represents the semantic plane,

a semantic plane centroid is represented which is,

a semantic plane normal parameter is represented that,

represents semantic plane class labels (horizontal/vertical),

representing flat semantic category labels.

Directly mapping the first received semantic plane to a first semantic landmark, denoted L_i:

wherein

And

respectively representing the semantic landmark centroid and the normal under the world coordinate system. Utilization thereof

And

the coordinate transformation is used for solving the problem, and the specific formula is as follows:

wherein x_rIndicating the current camera position in the world coordinate system,

representing a rotation matrix from the camera coordinate system to the world coordinate system.

After the initial semantic road sign setting is finished, the semantic plane S is detected by each frame through the following three steps_kData association with semantic roadmapping:

firstly, matching semantic planes extracted from frames with semantic road signs according to categories and plane types, calculating the point cloud number and the area of the semantic planes after successful matching, and deleting the point cloud number which is smaller than a minimum threshold value t_pOr the plane area is less than the minimum threshold t_aThe semantic plane of (2);

converting the normal direction of the semantic plane from a camera coordinate system to a world coordinate system by using coordinate transformation to obtain

Normal to semantic road sign

And (6) matching. If it is

And

the deviation between is less than the set threshold value t_nIf the semantic plane is consistent with the normal direction of the road sign, entering subsequent coordinate association, otherwise, failing to match;

converting the centroid of the semantic plane from the camera coordinate system to the world coordinate system, and calculating the landmark centroid matched with the centroid of the semantic planeAnd if the Mahalanobis distance is greater than a set threshold value, mapping the detected semantic object into a new semantic landmark l_jOtherwise, the semantic object and the current semantic road sign l_iMatching;

by the steps, the observation relation between the current key frame and the semantic road sign can be obtained.

And 4, step 4: pre-integration between two adjacent image frames based on inertial sensor data:

the inertial sensor data obtained at time k

And

including accelerometer data from time k-1 to time k

And gyroscope data

Wherein i is 0,1,2, …, (t (k) -t (k-1))/Δ t, t (k) is a sampling time corresponding to the time k, t (k-1) is a sampling time corresponding to the time k-1, Δ t is a sampling period of the inertial sensor, and the inertial sensor measurement model is as follows:

wherein ,n_a and n_ωAccelerometer and gyroscope white noise, respectively;

and

random walk of accelerometer and gyroscope, its derivative is white noise;

is an ideal value measured by the accelerometer,

pre-integration process between two adjacent inertia frames:

wherein ,

in order to pre-integrate the position,

for speed pre-integration, the rotation is represented by a quaternion y,

and

is a non-volatile organic compound (I) with a value of 0,

a multiplication operation representing a quaternion; a plurality of inertial sensor data are arranged between two adjacent image frames, and all the inertial sensor data between the two adjacent image frames are subjected to pre-integration in an iteration mode through the formula, so that the frequency of the visual data is consistent with the inertial pre-integration frequency, and the inertial pre-integration between the two adjacent image frames is obtained

And

and 5: optimally solving carrier navigation information by combining a semantic plane observation residual error, a visual odometer relative pose observation residual error and an inertia pre-integration residual error:

a) establishing an optimized variable X:

wherein ,

n is the sequence number of the last frame,

and

respectively representing the carrier position and speed of the k-th frameRandom walk of degrees, quaternions, accelerometers, and gyroscopes;

the coordinates of the semantic plane features are used, and m is the serial number of the last semantic plane feature;

(b) establishing an optimization function e:

representing the semantic landmark position error observed by a certain frame of the camera. The VO part represents the observed residual error of the relative pose in the visual odometer. The third term is the IMU pre-integration estimate residual. P (-) represents a robust kernel function, Σ^-1An information matrix representing the error amount is an inverse of the covariance and represents the pre-estimation of each constraint accuracy. Wherein i and j refer to the ith and jth image frames respectively, and k refers to the kth semantic landmark.

For semantic roadmap observation errors:

wherein ,

and

respectively representing the pose of the camera at the ith frame and the landmark L for variables to be optimized_kIn the world system. e about

And

the Jacobian matrix of errors is as follows:

wherein xi is the pose

is composed of

Of the rotating component of (1).

For VO pose observation residuals:

wherein ,

and observing the relative pose of the VO at the ith frame and the jth frame.

And

are variables to be optimized. By utilizing the lie algebra right disturbance model and the adjoint property thereof, the pose of e can be solved respectively

The Jacobian matrix of errors is specified as follows:

for the inertial pre-integration residual, it is obtained from the difference between the predicted value and the pre-integration value between two adjacent image frames:

in the above formula, the first and second carbon atoms are,

quaternion of the rotation of the body coordinate system under the navigation coordinate system at the time i and the time j respectively；

the gyroscope respectively moves randomly at i moment and j moment under the coordinate system of the body, [ gamma ]]_xyzRepresenting the x, y, z components of a quaternion gamma;

(c) and (4) carrying out iterative solution on the optimization function by using a Levenberg-Marquardt algorithm, and stopping iteration when error convergence or a set maximum iteration number is reached to obtain carrier navigation information.

Step 6: and outputting carrier navigation information and camera internal parameters, and returning to the step 1.

When the camera is more mobile, the navigation positioning accuracy is low by only depending on the visual features for motion estimation. According to the invention, semantic features and inertia information are fused into an optimization objective function of the visual odometer, high-dimensional feature observation and extra sensor observation constraint are added for the visual odometer, relative motion in a short time is estimated, and pose optimization is carried out, so that the pose estimation precision of the visual odometer under the condition of strong mobility can be improved, and the navigation positioning precision and robustness of the visual odometer can be improved.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A visual inertial integrated navigation method fused with semantic features is characterized by comprising the following steps:

And gyroscope data

And 6, outputting carrier navigation information and camera internal parameters, and returning to the step 1.

2. The visual inertial integrated navigation method fused with semantic features according to claim 1,

the method for acquiring the current pose of the camera specifically comprises the following steps: firstly, ORB features are extracted from two adjacent frames, then the relative pose between the two key frames is calculated by utilizing the ORB feature matching relationship between the two frames and pnp, and the pose T (k) of the camera at the moment k is obtained by accumulating the relative poses.

3. The visual inertial integrated navigation method fused with semantic features according to claim 2,

acquiring plane information based on visual sensor data between two adjacent image frames, and constructing semantic plane features based on semantic categories, centroids, normal directions and horizontal and vertical types of the planes:

· s_p＝{p_x,p_y,p_z}

· s_n＝{n_x,n_y,n_z,n_d}

·

· s_ccorresponding detected semantic object classes

wherein ,s_pIs a plane centroid, s_nIs a plane normal parameter, s_oAs plane class labels (horizontal/vertical), s_cA plane semantic class label which depends on the semantic object class corresponding to the plane;

4. The visual inertial integrated navigation method fused with semantic features according to claim 1, wherein the inertial sensor data obtained at the time k comprises accelerometer data from the time k-1 to the time k

And gyroscope data

For constructing an inertial sensor measurement model:

wherein ,n_a and n_ωAccelerometer and gyroscope white noise, respectively;

and

random walk of accelerometer and gyroscope, its derivative is white noise;

is an ideal value measured by the accelerometer,

wherein ,

in order to pre-integrate the position,

for speed pre-integration, the rotation is represented by a quaternion y,

and

is a non-volatile organic compound (I) with a value of 0,

a multiplication operation representing a quaternion;

And

5. the visual inertial integrated navigation method fused with semantic features according to claim 1, wherein the optimization function is:

representing a semantic landmark observation error observed by a certain frame of a camera, wherein the VO part represents an observation residual error of a relative pose in a visual odometer, the third term is an IMU pre-integration estimation residual error, rho (·) represents a robust kernel function, and Σ^-1And an information matrix which is the inverse of the covariance and represents the pre-estimation of each constraint precision, wherein i and j refer to the ith and jth image frames respectively, and k refers to the kth semantic landmark.

6. The visual inertial integrated navigation method fused with semantic features according to claim 5,

the semantic roadmap observation error

wherein ,

and

calculate e about

And

a Jacobian matrix of errors;

the pose observation residual

wherein ,

and

is a variable to be optimized;

respectively solve the e-related pose

A Jacobian matrix of errors;

calculating inertial pre-integral residual

7. The visual inertial integrated navigation method fused with semantic features according to claim 6, wherein e is related to in the semantic roadmap observation error

And

the Jacobian matrix of errors is:

wherein xi is the pose

Lie algebraic representation of [ X ]_i′Y_i′Z_i′]^TIs the semantic plane coordinate of i moment camera system, f_x、f_yIs the focal length in the camera's internal parameters,

is composed of

Of the rotating component of (1).

8. The visual inertial integrated navigation method fused with semantic features according to claim 6, wherein the pose observation residual error is e-pose-related

The Jacobian matrix of errors is:

9. the visual inertial integrated navigation method based on the semantic feature fusion of claim 6, wherein the inertial pre-integration residual error formula is as follows:

wherein ,