CN116299525A

CN116299525A - Dynamic environment RGB-D vision SLAM method based on point cloud region correlation

Info

Publication number: CN116299525A
Application number: CN202310176750.3A
Authority: CN
Inventors: 禹鑫燚; 郑万财; 欧林林; 周利波; 魏岩
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-06-23

Abstract

A dynamic environment RGB-D vision SLAM method based on point cloud region correlation can be used as front-end preprocessing of an ORBSLAM2/3 system to filter out dynamic regions, so that erroneous data association is reduced. The input of the whole system requires the generation of point cloud by the RGB image and the depth map of the current frame, and the point cloud is temporarily stored in the system as a reference frame after one frame is processed, and secondary processing is not performed. The depth map is clustered by using a clustering algorithm to divide sub-point clouds, then sub-pose changes between the sub-point clouds and a reference frame are sequentially calculated, meanwhile, an idea of dynamic and static correlation of the sub-point clouds is provided, which sub-point clouds belong to dynamic point clouds can be effectively identified by combining the idea and the sub-pose, the constraint of epipolar geometry is not needed, and the approximate self pose of a camera is not needed to be known in advance. The invention well reduces the positioning influence of the dynamic object on the robot, and improves the robustness of the system and the generalization of the scene.

Description

Dynamic environment RGB-D vision SLAM method based on point cloud region correlation

Technical Field

The invention relates to the field of global positioning of robots, in particular to a dynamic environment RGB-D vision SLAM method based on point cloud region correlation.

Background

Today, visual real-time localization and mapping is widely studied, and it is attempted to be applied to various devices as the core of localization, such as blind aids, reality augmentation technology, robots, autopilot, etc. With the development of nearly 20 years, the overall framework of visual SLAM is gradually maturing, and at the same time, many advanced visual positioning systems with high precision and high robustness have been proposed. These vision-based systems may estimate the camera's own motion from only successive images captured by the surrounding environment and generate successive motion trajectories representing the camera. However, most excellent systems are designed assuming that the surrounding environment is static in order to reduce the estimation problem. If dynamic objects exist in the environment, the irregular moving landmarks are introduced in actual observation, so that the system cannot distinguish whether the system is moving or the objects in the environment are moving, and finally, inaccuracy of positioning information and serious deviation of environment map construction are caused. In order to solve the problem of poor performance of the visual SLAM in a dynamic environment, researchers take a number of measures, such as patent number CN115393538A, which performs feature extraction on the image information based on GCNv2 algorithm, and removes dynamic feature points in the image information by using an optical flow dynamic point removal algorithm; patent number CN114283198A detects a dynamic target in an image using a target detection neural network YOLOv 5; and combining the image depth information, and processing by a pixel level segmentation method to achieve the effect of pixel level segmentation. The existing method is based on deep learning, however, the method based on deep learning needs to know which objects are moving in advance and which objects appear in the field of view of the equipment, besides, the calculation amount of an inference model is huge, and the method is difficult to apply to occasions with low cost and high real-time requirements.

At present, the following defects and disadvantages exist for the positioning of the robot in a dynamic environment: 1) The unknown characteristic points can be further identified and judged on the basis of the pose estimation information, and the system failure can be directly caused under severe environments. 2) The deep learning method needs to know which objects are moving and which objects appear in the visual field of the equipment in advance, besides, the calculation amount of the inference model is huge, and the method is difficult to be applied to occasions with low cost and high real-time requirements. 3) The existence of dynamic objects is unavoidable in a real environment based on the fact that the geometrical method is easy to have false negative (static mark is dynamic) phenomenon 4), and the existence of the dynamic objects can greatly influence the positioning precision of equipment, and is one of main problems which prevent the further development of vision SLAM.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a dynamic environment RGB-D vision SLAM method based on the correlation of the point cloud area, which can improve the positioning accuracy and the robustness in a dynamic real environment. The invention relates to an independent segmentation method which does not need to consider the motion of a camera and gets rid of projective transformation which is needed to be relied on by most of the current algorithms, and in addition, the identification of a dynamic region can be achieved without prior information of training samples and dynamic objects.

In order to achieve the above purpose, the present invention provides the following technical solutions: a dynamic environment RGB-D vision SLAM method based on point cloud region correlation comprises the following steps:

s1: performing two-dimensional plane downsampling on the RGB image and generating a point cloud by combining a depth map;

the specific step of generating the point cloud comprises the following steps:

s11: and carrying out downsampling on a two-dimensional plane on the gray level diagram to generate a point cloud by simulating a data acquisition form of the rotary laser radar on the current frame so as to accelerate construction of the KD tree.

S12: the downsampled point cloud is rasterized by a grid with a side length of l, and an ith grid G is calculated _i Gray average value of all points in the gridIntensity value of center of gravity:

representing grid G _i I (x, y) represents the gray value of the pixel point coordinates (x, y).

S2: minimizing the point cloud generated in the step S1 into sub point clouds;

the generating of the sub-point cloud specifically comprises the following steps:

s21: and removing the depth unstable region, and setting the depth value of a certain point in the depth image to be 0 when the depth value is larger than a certain value according to the noise model of the depth camera and the empirical value obtained in the experiment.

S22: since the point cloud obtained by the original depth of the RGBD camera has a multi-layer structure or a cavity, the point cloud obtained by the S1 is smoothed by using the least square of the movement.

S23: and clustering the point clouds by using a K-Means clustering algorithm to obtain K sub-point clouds.

S3: setting the previous frames of the current frame as reference frames;

s4: an accurate transformation matrix is obtained by adopting a method of fusion of gray level auxiliary nearest neighbor iteration and traditional nearest neighbor iteration, comprising the following steps: carrying out gray nearest neighbor iteration on each sub-point cloud of the S2 and the target point cloud of the reference frame to calculate a rough transformation matrix and a corresponding point, and then calculating an accurate transformation matrix by utilizing the conventional nearest neighbor iteration to register the point cloud;

the intensity-assisted nearest neighbor iterative registration method specifically comprises the following steps:

s41: estimating and predicting initial pose according to uniform motion model of ORBSLAM2

Providing the initial pose can effectively prevent the sinking into the locally optimal solution and accelerate the iteration speed:

t in (2) _(·) Are all T in the world coordinate system _pre Representing the increment of pose transformation between the last two common frames, T _last Representing the camera pose of the previous frame, T _ref Pose of reference frame.

S42: according to the initial pose

Search for sub-point cloud->

Alpha closest points closest to the target point cloud as candidate corresponding points +.>

S43: calculating candidate points

Selecting the highest score S (i, k) as the corresponding point +.>

Delta T represents the incremental matrix for each iteration, r ^G Representing geometric residual error, r ^I And (×) represents the intensity residual, w is the weight conversion function.

S431: wherein the weight transformation function of S43 is assumed herein to obey the t-distribution, which weight transformation function can be expressed as:

v represents the degree of freedom of the distribution, μ ^(*) Mean value representing geometric or intensity residual, and the same thing sigma ^(*) Represented as variance.

S44: and S4, calculating the pairing points of the points, and calculating a gain matrix by using the pairing points to ensure that the total Euclidean distance sum is minimum:

s45: updating transformation matrix T ^* ←△TT ^* 。

S5: transforming the source point clouds according to the transformation matrix calculated in the step S4, and estimating the degree of correlation between every two sub point clouds;

the correlation degree calculation includes:

s51: find the i-th point

Is the closest point P of (2) _tgt (j) Besides, the Euclidean distance is calculated as a main index, an additional intensity value is used as a punishment item, and the score of all points is accumulated and averaged to be used as a correlation degree score:

pose transformation matrix T for representing source point cloud passing through jth sub point cloud _Bj In the point cloud obtained later->

Ith point, ++>

Representation dot->

Intensity values of (2).

S6: calculating a threshold divides all sub-point clouds into two groups: a dynamic correlation group and a static correlation group;

s61: a correlation degree score { R (j) } of the sub-point cloud calculated according to S5 _j∈K Setting a threshold epsilon as a zone static sub-point cloud, the threshold can be determined by the following formula:

s62: according to the threshold value, which point clouds belong to the static point clouds can be effectively judged:

s represents a static sub-point cloud set, D belongs to a dynamic sub-point cloud set, and all point clouds in the dynamic sub-point cloud set are projected onto a camera plane to generate a mask.

S7: projecting the point cloud of the dynamic correlation group to a camera plane, drawing a circle by taking the point cloud as a center, and filling the point cloud by morphological geometry to finally form a required pixel mask;

s8: the feature points in the mask are not added into pose estimation of the camera;

s9: the processed frame is stored in the system as a reference frame and is used as an alternative frame for updating the reference frame;

s10: and refining the mask generated by the key frames by using the historical observation data to generate more stable map points.

S101, when the key frame is generated, taking the previous N frames as historical observation data, not all historical data, and estimating the pixel point weight value by the following formula:

I _i (x, y) represents the pixel point with the coordinate of (x, y) of the ith data observation, ρ is the weight proportion of the pixel, and the pixel weight is lambda times higher than the total observation is regarded as a dynamic pixel point:

s102: since feature points are easily generated at the edges of the object, but these edge points are extremely unstable, the mask generated by the refined key frame will be properly expanded so that the edge points do not participate in the camera pose estimation process.

The invention can be used as front-end preprocessing of the ORBSLAM2/3 system to filter out dynamic areas, thereby reducing erroneous data association. The input of the whole system requires the generation of point cloud by the RGB image and the depth map of the current frame, and the point cloud is temporarily stored in the system as a reference frame after one frame is processed, and secondary processing is not performed. The K-Means is utilized to cluster and divide the depth map into sub point clouds, then sub pose changes between the sub point clouds and the reference frame are sequentially calculated, meanwhile, the invention provides an idea of dynamic and static correlation of the sub point clouds, which sub point clouds belong to dynamic point clouds can be effectively identified by combining the idea and the sub pose, the dependency on epipolar geometric constraint is avoided, and the approximate self pose of a camera is not required to be known in advance. The invention well reduces the positioning influence of the dynamic object on the robot, and improves the robustness of the system and the generalization of the scene.

Compared with the prior art, the invention has the beneficial effects that:

the invention discloses a dynamic environment RGB-D vision SLAM method based on the correlation of point cloud areas, which solves the problem that the traditional dynamic vision SLAM can judge the dynamic areas only by considering the movement of a camera, and in addition, the prior knowledge of training samples and dynamic objects is not needed. Meanwhile, the positioning accuracy and the robustness of the visual SLAM in a dynamic environment are improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2a to 2b are schematic diagrams of detection effects of a dynamic region according to an embodiment of the disclosure, where fig. 2a is an original image and fig. 2b is a schematic diagram of detection results of the dynamic region;

FIGS. 3 a-3 b are schematic diagrams of the present invention in comparison to a true trajectory and a trajectory of the ORB-SLAM2 algorithm on a TUM dataset, wherein FIG. 3a is a translational movement of the camera along the X-Y-Z axis, FIG. 3b is a movement of the camera along a hemisphere with a radius of 1 meter, a blue trajectory is a positioning result trajectory of the present invention, a dashed line is a movement trajectory of the true camera, and a green trajectory is a positioning result trajectory of the ORB-SLAM 2;

FIGS. 4 a-4 b are diagrams illustrating the effect of keyframe refinement in accordance with embodiments of the present disclosure, wherein FIG. 4a is a sequence of historical observation data frames, FIG. 4b is an original keyframe prior to refinement, and FIG. 4c is a keyframe after refinement;

fig. 5a to 5b are schematic views of the dynamic region detection effect under an actual scene according to the disclosed embodiment of the invention, wherein fig. 5a is an original image, and fig. 5b is a schematic view of the dynamic region detection result.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings. It should be understood that the examples are presented for the purpose of illustrating the invention only and are not intended to limit the invention.

The invention discloses a dynamic environment RGB-D vision SLAM method flow chart based on point cloud area correlation, which is shown in figure 1 and comprises the following steps.

the specific steps of the point cloud generation are as follows:

s11: and (3) carrying out data acquisition form of simulating a rotary laser radar on the current frame, extracting a gray level image at intervals of 4 pixels and 4 pixels, and generating a point cloud by combining a depth image, so as to realize downsampling on a two-dimensional plane and accelerate construction of a KD tree.

S12: the downsampled point cloud is rasterized by a grid with a side length of 0.05, and an ith grid G is calculated _i Gray average value of all points in the grid, and the gray average value is taken as the intensity value of the center of gravity of the grid:

S2: minimizing the point cloud generated in the step S1 into sub point clouds;

the specific steps of generating the sub-point cloud are as follows:

s21: and removing the depth unstable region, and setting the depth value of a certain point in the depth image to be 0 when the depth value is larger than 6 meters according to the noise model of the depth camera and the empirical value obtained in the experiment.

S23: and clustering the point clouds by using a K-Means clustering algorithm to obtain K=10 sub-point clouds.

S3: setting the first 3 frames of the current frame as reference frames;

s4: an accurate transformation matrix is obtained by adopting a method of fusion of gray level auxiliary nearest neighbor iteration and traditional nearest neighbor iteration, comprising the following steps: performing 5 gray nearest neighbor iterations on each sub-point cloud of the S2 and the target point cloud of the reference frame to calculate a rough transformation matrix and a corresponding point, and performing 5 iteration to point cloud registration by using the traditional nearest neighbor iterations to calculate an accurate transformation matrix;

S42: according to the initial pose

Search for sub-point cloud->

α=10 closest points closest to the target point cloud as candidate corresponding points +.>

S43: calculating candidate points

Selecting the highest score S (i, k) as the corresponding point +.>

v represents the degree of freedom of the distribution, and is set to 5 in the experiment. Mu (mu) ^(*) Mean value representing geometric or intensity residual, and the same thing sigma ^(*) Represented as variance.

s45: updating transformation matrix T ^* ←△TT ^* 。

the correlation degree calculation includes:

find the i-th point

pose transformation matrix representing source point cloud passing through jth sub point cloud>

In the point cloud obtained later->

Ith point, ++>

Representation dot->

Intensity values of (2).

s9: the processed frames are stored in a system and used as alternative frames updated by reference frames;

S101, when the key frame is generated, the first n=7 frames are taken as history observation data, not all history data, and the pixel weight is estimated by the following formula:

I _i (x, y) represents the pixel point with the coordinate of (x, y) of the ith data observation, and ρ is the weight proportion of the pixel. In the experiment, the weight rho of the key frame is 3, the weight rho of the common frame is 2, and N is set to 7. A pixel weight exceeding λ=0.6 times the total observation is considered a dynamic pixel point:

Fig. 2a and 2b identify a dynamic region on the TUM dataset for a pedestrian, which is well identified by the present invention and less affected by the rotation of the camera. Fig. 3a and 3b are graphs comparing absolute positioning accuracy of the present invention with the trajectories of the true and ORB-SLAM2 algorithms, where the trajectories of the present invention closely fit the true trajectories, however, the trajectories of ORB-SLAM2 deviate severely, which proves that the present invention can improve positioning accuracy and robustness of SLAM systems in dynamic environments. Fig. 4a and 4b are visual effect diagrams of the key frame refinement. The original keyframes before refinement of fig. 4b are filtered by the historical observation data frame sequence of fig. 4a, and finally the keyframes after refinement of fig. 4c are obtained. As the mask area of the refined key frame is greatly reduced, the influence of false negative can be effectively reduced, and the robustness of the system is improved. Fig. 5a and 5b are visualizations of dynamic region identification in a real environment, which fully represent the reliability and utility of the present invention.

It should be emphasized that the embodiments described herein are illustrative rather than limiting, and that this invention includes, by way of example, but is not limited to the specific embodiments described herein, as other embodiments similar thereto will occur to those of skill in the art based upon the teachings herein and fall within the scope of this invention.

Claims

1. The dynamic environment RGB-D vision SLAM method based on the point cloud region correlation is characterized by comprising the following steps:

s2: minimizing the point cloud generated in the step S1 into sub point clouds;

s3: setting the previous frames of the current frame as reference frames;

s7: projecting the point cloud of the dynamic correlation group to a camera plane, drawing a circle by taking the point cloud as a center, and filling by morphological geometry to finally form a pixel mask required by us;

s9: storing the processed frame in a system as an alternative frame for updating the reference frame;

2. The dynamic environment RGB-D visual SLAM method based on point cloud region correlation of claim 1, wherein the point cloud generation in step S1 comprises:

s11: the method comprises the steps that a data acquisition form of a rotary laser radar is simulated for a current frame, and down-sampling on a two-dimensional plane is conducted on a gray level diagram to generate a point cloud so as to accelerate construction of a KD tree;

s12: the downsampled point cloud is rasterized by a grid with a side length of l, and an ith grid G is calculated _i Gray average value of all points in the grid, and the gray average value is taken as the intensity value of the center of gravity of the grid:

3. The dynamic environment RGB-D visual SLAM method based on point cloud region correlation of claim 1, wherein the generating the sub-point cloud in step S2 comprises:

S22: because the point cloud obtained by the original depth of the RGBD camera has a multi-layer structure or a cavity, the point cloud obtained by the S1 is smoothed by using the least square of the movement;

4. The method for dynamic environment RGB-D vision SLAM based on point cloud region correlation of claim 1, wherein step S4 gray-scale assisted nearest neighbor iterative computation of the transformation matrix comprises:

t in (2) _(·) Are all T in the world coordinate system _pre Representing the increment of pose transformation between the last two common frames, T _last Representing the camera pose of the previous frame, T _ref Pose of reference frame;

s42: according to the initial pose

Search for sub-point cloud->

Alpha nearest points nearest to the target point cloud are used as candidate corresponding points

S43: calculating candidate points

Selecting the highest score S (i, k) as the corresponding point +.>

Delta T represents the incremental matrix for each iteration, r ^G Representing geometric residual error, r ^I Representing the intensity residual, w being the weight conversion function;

v represents the degree of freedom of the distribution, μ ^(*) Mean value representing geometric or intensity residual, and the same thing sigma ^(*) Represented as variance;

s45: updating transformation matrix T ^* ←ΔTT ^* 。

5. The dynamic environment RGB-D visual SLAM method based on point cloud region correlation of claim 1, wherein the calculating of the degree of correlation between sub-point clouds in step S5 comprises:

find the i-th point

In the point cloud obtained later->

The point(s) of the (i) th,

representation dot->

Intensity values of (2).

6. The method of dynamic environment RGB-D vision SLAM based on point cloud area correlation of claim 1, wherein step S6 of calculating the threshold value of the zone transfer static correlation group comprises:

s62: judging which point clouds belong to static point clouds according to a threshold value:

s represents a static sub-point cloud set, D belongs to a dynamic sub-point cloud set, and all point clouds belonging to the D set are projected onto a camera plane to generate a mask.

7. The method of claim 1, wherein in step S7, all the point clouds in the dynamic correlation group are projected onto the camera plane, and a circle is drawn around it, and then filled by morphological geometry to finally form the required pixel mask.

8. The dynamic environment RGB-D vision SLAM method based on point cloud region correlation of claim 1, wherein step S10 comprises:

I _i (x, y) represents a pixel point with the coordinates (x, y) of the ith data observation, ρ is the pixel weight proportion, and the pixel weight exceeding λ times of the total observation is regarded as a dynamic pixel point:

s102: because the feature points are easily generated at the edges of the object, but the edge points are unstable, the mask generated by the refined key frames is expanded, so that the edge points do not participate in the pose estimation process of the camera.