CN112381841A

CN112381841A - Semantic SLAM method based on GMS feature matching in dynamic scene

Info

Publication number: CN112381841A
Application number: CN202011365138.3A
Authority: CN
Inventors: 陈政; 游林辉; 胡峰; 孙仝; 张谨立; 宋海龙; 黄达文; ***; 梁铭聪; 黄志就; 何彧; 陈景尚; 谭子毅; 潘嘉琪; 李志鹏; 罗鲜林
Original assignee: Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-02-19

Abstract

The invention relates to a semantic SLAM method based on GMS feature matching in a dynamic scene, which comprises the steps of segmenting an input image through a semantic segmentation network to obtain mask codes of all objects, and removing the mask codes of dynamic objects to obtain a primary image with the dynamic objects removed; extracting ORB characteristic points from an input image, and then calculating a descriptor; detecting and eliminating dynamic feature points according to a method combining motion consistency and semantic information; the running precision and robustness of the visual SLAM system in a high-dynamic environment are improved by combining the motion consistency and the semantic information to remove the dynamic mode.

Description

Semantic SLAM method based on GMS feature matching in dynamic scene

Technical Field

The invention relates to the field of positioning and navigation based on vision in autonomous inspection of unmanned aerial vehicles, in particular to a semantic SLAM method based on GMS feature matching in a dynamic scene.

Background

The unmanned aerial vehicle intelligent inspection process requires the unmanned aerial vehicle to autonomously determine the next operation according to the real-time information of the current environment. Therefore, real-time positioning of the unmanned aerial vehicle and construction of a diagram of a working environment are important links in the intelligent inspection process of the unmanned aerial vehicle. Especially in the cooperative work of a plurality of unmanned aerial vehicles in a gridding arrangement, the environment detected by each unmanned aerial vehicle is a dynamic scene (including an occasional moving object), so a special algorithm needs to be developed for the dynamic scene in the positioning and environment mapping process.

Meanwhile, location and Mapping (SLAM) is a technology that can estimate the current position and attitude by a corresponding motion estimation algorithm through a sensor and establish a three-dimensional map of an environment without any environment prior information. With the development of computer vision and deep learning technology and the improvement of hardware computing capability, vision-based SLAM research is continuously deepened and widely applied to the fields of autonomous driving, mobile robots, unmanned aerial vehicles and the like.

Chinese patent application documents with the publication number of "CN 110322511A" and the publication date of 2019, 10 and 11 disclose a semantic SLAM method and a semantic SLAM system based on object and plane characteristics, wherein RGB-D image streams of a scene are obtained, and key frame images are obtained by utilizing the RGB-D image streams to perform frame-by-frame tracking; constructing a local map of a scene by using a key frame image, performing plane segmentation on a depth map of the key frame image to obtain a current plane, constructing a global plane map by using the current plane, performing object detection on the key frame image to obtain a detection frame and confidence, reconstructing point cloud of an object by using the detection frame and the confidence, merging feature points in the detection frame to the object to obtain a global object map; and performing loop detection by using the key frame image to obtain a loop frame, and performing loop correction to optimize plane constraint and object constraint by using the loop frame to obtain a plane map and an object map of the scene. The invention can improve SLAM optimization performance and enhance semantic description of the environment.

However, in the above method, the application scene is basically static, and the dynamic information in the environment is not negligible in the actual application process. The SLAM method lacks a mechanism for processing a dynamic scene, and a series of problems such as initialization failure, overlarge positioning error, map building failure and the like can be caused, so that the problems of low running precision and poor robustness of the SLAM system in the dynamic scene are caused.

Disclosure of Invention

In order to solve the problem that the SLAM method in the prior art is low in detection precision and robustness, the invention provides a semantic SLAM method based on GMS feature matching in a dynamic scene, and the precision and robustness of the running of a visual SLAM system in a dynamic environment are improved by combining motion consistency and semantic information to remove the dynamic.

In order to solve the technical problems, the invention adopts the technical scheme that: a semantic SLAM method based on GMS feature matching in a dynamic scene comprises the following steps:

the method comprises the following steps: calibrating a camera, removing image distortion, acquiring and inputting an environment image;

step two: segmenting the input image through a semantic segmentation network to obtain mask codes of all objects, and removing the mask codes of the dynamic objects to obtain a primary image with the dynamic objects removed;

step three: extracting ORB characteristic points from an input image, and then calculating a descriptor;

step four: detecting and eliminating dynamic feature points according to a method combining motion consistency and semantic information;

step five: performing feature point matching by using a GMS algorithm to remove mismatching;

step six: obtaining a camera pose by tracking the thread;

step seven: performing point cloud processing through a local mapping process to obtain a sparse point cloud map;

step eight: and optimizing the pose by using loop detection and correcting drift errors.

Preferably, in the step one, calibrating the camera, and specifically removing image distortion comprises:

s1.1: first, the internal reference of the camera is obtained, wherein the internal reference comprises f_x,f_y,c_x,c_yNormalizing the three-dimensional coordinates (X, Y, Z) to homogeneous coordinates (X, Y);

s1.2: removing the effect of distortion on the image, where k₁,k₂,k₃,p₁,p₂]For the distortion coefficient of the lens, r is the distance of the point from the origin of the coordinate system:

s1.3: and transferring the coordinates in the camera coordinate system to the pixel coordinate system:

preferably, the semantic segmentation network is a lightweight semantic segmentation network FcHarDnet. The model size is reduced by convolution of HDB block connection 1x1, and the network has image processing speed about 30% higher than other network structures under the same hardware environment. Potential dynamic regions of the image are segmented by FcHarDnet and a mask is generated.

Preferably, in the third step, a gaussian pyramid is constructed in the process of extracting ORB feature points from the input image, and feature point detection is performed on different layers of the pyramid, so that the characteristic of unchanged scale is achieved.

Preferably, the specific steps of ORB feature point extraction are as follows:

the specific steps of ORB feature point extraction are as follows:

s3.1: when the difference between the gray value i (x) of more than N points around a certain point P and the gray value i (P) of the point P is greater than the threshold epsilon, the point is considered as a target corner point, which is specifically represented as:

s3.2: to make the feature points hold the orientation unchanged, the centroid is calculated:

in the formula, m₀₀，m₁₀，m₀₁Is the integral of the pixel gray scale of the area in the circle about the key point;

s3.3: by scaling the original image sequence by a certain ratio, the image pyramid is patterned. Extracting corresponding angular points from the images with different sizes at each level of the image pyramid;

s3.4, adopting a quadtree uniform algorithm to continuously subdivide the image downwards into nodes with the same four-division size, combining the nodes without the characteristic points, and increasing the nodes when the number of the characteristic points in the nodes is more than 1; and after the node distribution is finished, deleting redundant feature points in the child nodes.

Preferably, in the fourth step, the motion consistency is detected as: the static point can satisfy epipolar geometric constraint, if the matching point of the object static feature point re-projected on the reference frame is definitely located on the intersection line of the reference frame and the epipolar plane; the method comprises the following specific steps:

S4.1：p₁，p₂is the homogeneous coordinates of the matching points in the current frame and the reference frame, where u, v are the values in the image frame, specifically:

p₁＝[u₁,v₁,1],p₂＝[u₂,v₂,1]

s4.2: calculating polar lines, wherein F is a basic matrix which can be obtained by calculating eight pairs of characteristic points, and specifically comprises the following steps:

wherein [ X Y Z]^TRepresenting an epipolar vector;

s4.3: calculating the distance from the matching point to the corresponding epipolar line, if the distance is static D approaching 0, and when D is greater than a threshold epsilon, the characteristic point is dynamic, specifically:

the semantic segmentation and the dynamic consistency check are carried out on the image through the steps, but whether the object is dynamic or not cannot be accurately judged from the segmentation result. Whether the characteristic points are dynamic or not can be detected through dynamic consistency, but accurate outline information of the object is not available. The dynamic point determination method comprises the following steps: if there are a sufficient number of dynamic points detected by motion consistency within the mask of the object semantic segmentation, all points of the object are considered dynamic, and all points within the whole object mask are eliminated.

Preferably, in the fifth step, the feature points of the dynamic points removed in the fourth step are subjected to feature matching; the Grid Motion Statistics (GMS) algorithm proposes an assumption based on the smoothness of the motion: the smoothness of the motion causes a similar region to appear around the match, where the position of such region on both graphs moves smoothly in a true match and not smoothly in a false match. That is, there are other correct matches around the correct match that support the match, while a false match is only an occasional case and so there is no or very little other match support around it.

The GMS algorithm eliminates the error matching and comprises the following specific steps:

s5.1: the image is divided into 20 × 20 non-repetitive grids, and in order to solve the problem that the features are positioned at the boundaries of the grids, the length and the width of the grids can be adjusted to perform iterative computation.

S5.2: dividing 3 x 3 pixels around each feature point into a unit, and calculating the sum S of the total matching number of each unit { i, j } and the corresponding neighbor (9 grids) of the reference frame in the neighbor_ij。

Wherein the content of the first and second substances,

representing the number of matched feature points on the corresponding grid pair;

s5.3: calculating a threshold for dividing each cell correct and mismatch, where n_iI.e. the average number of feature points in each grid, where α is an empirical value of 5, and the calculation formula is as follows:

s5.4: and repeating the steps S5.2 and S5.3 until the image traversal is completed to obtain the correct matching with the reference frame.

Preferably, in the sixth step, the tracking thread specifically includes: and estimating the pose of the current frame according to the motion model in the step five, tracking the map point of the previous frame according to the uniform motion model, and determining the pose.

Preferably, point cloud processing is carried out through a local mapping process, and a sparse point cloud map is obtained through local BA optimization.

Preferably, in the step eight, the bag-of-words model is used to judge whether a loop is generated, and if so, the accumulated error is corrected through the loop to optimize the pose.

Compared with the prior art, the invention has the beneficial effects that:

1) according to the invention, the running precision and robustness of the visual SLAM system in a high-dynamic environment are improved by combining the motion consistency and the semantic information to remove the dynamic mode.

2) The invention mainly aims at the problem of feature matching and provides a method based on grid and with motion statistical characteristics, and the method can quickly eliminate wrong matching so as to improve the stability of matching.

Drawings

Fig. 1 is a flowchart of a semantic SLAM method based on GMS feature matching in a dynamic scenario according to the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there are terms such as "upper", "lower", "left", "right", "long", "short", etc., indicating orientations or positional relationships based on the orientations or positional relationships shown in the drawings, it is only for convenience of description and simplicity of description, but does not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationships in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.

The technical scheme of the invention is further described in detail by the following specific embodiments in combination with the attached drawings:

examples

Fig. 1 shows an embodiment of a semantic SLAM method based on GMS feature matching in a dynamic scenario, which includes the following steps:

the method comprises the following steps: calibrating a camera and removing image distortion; acquiring and inputting an environment image; calibrating a camera, and specifically removing image distortion by the following steps:

step two: segmenting the input image through a semantic segmentation network to obtain mask codes of all objects, and realizing preliminary dynamic segmentation; the semantic segmentation network is a lightweight semantic segmentation network FcHarDnet. The model size is reduced by convolution of HDB block connection 1x1, and the network has image processing speed about 30% higher than other network structures under the same hardware environment. Potential dynamic regions of the image are segmented by FcHarDnet and a mask is generated.

Step three: extracting ORB characteristic points from an input image, and then calculating a descriptor; the specific steps of ORB feature point extraction are as follows:

s3.1: when the gray value difference between a certain number of points around a certain point P and the gray value of the point P is large, the point is regarded as a target corner point, and the specific expression is as follows:

Step four: detecting and eliminating dynamic feature points according to a method combining motion consistency and semantic information; the motion consistency is detected as: the static point can satisfy epipolar geometric constraint, if the matching point of the object static feature point re-projected on the reference frame is definitely located on the intersection line of the reference frame and the epipolar plane; the method comprises the following specific steps:

p₁＝[u₁,v₁,1],p₂＝[u₂,v₂,1]

the dynamic point determination method comprises the following steps: if there are a sufficient number of dynamic points detected by motion consistency within the mask of the object semantic segmentation, all points of the object are considered dynamic, and all points within the whole object mask are eliminated.

And matching the feature points in the adjacent frames by a fast nearest neighbor method, calculating the Hamming distance between the feature points, matching by the similarity degree between the feature points, and removing the mismatched feature points by adopting a PROSAC algorithm.

Step five: performing feature point matching by using a GMS algorithm to remove mismatching; the GMS algorithm eliminates the error matching and comprises the following specific steps:

S5.3: calculating a threshold value forDivide correct and mismatch, where n_iI.e. the average number of feature points in each grid, alpha is an empirical value of 5. The calculation formula is as follows:

Step six: and acquiring the pose of the camera through a tracking thread, estimating the pose of the current frame according to the motion model in the step five, tracking the map point of the previous frame according to the uniform motion model, and determining the pose.

Step seven: performing point cloud processing through a local mapping process, and obtaining a sparse point cloud map through local BA optimization;

step eight: and optimizing the pose by using loop detection and correcting drift errors. And judging whether a loop is generated or not by using the bag-of-words model, and if so, correcting the accumulated error by the loop to optimize the pose.

The beneficial effects of this example: 1) according to the invention, the running precision and robustness of the visual SLAM system in a high-dynamic environment are improved by combining the motion consistency and the semantic information to remove the dynamic mode.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A semantic SLAM method based on GMS feature matching in a dynamic scene is characterized by comprising the following steps:

the method comprises the following steps: calibrating a camera and removing image distortion; acquiring and inputting an environment image;

step six: obtaining a camera pose by tracking the thread;

2. The semantic SLAM method based on GMS feature matching in a dynamic scene, as claimed in claim 1, wherein in said step one, calibrating the camera, and removing image distortion specifically comprises the steps of:

3. the semantic SLAM method based on GMS feature matching in a dynamic scene according to claim 1, wherein the semantic segmentation network is a lightweight semantic segmentation network FcHarDnet.

4. The semantic SLAM method based on GMS feature matching in the dynamic scene as claimed in claim 1, wherein in the third step, a Gaussian pyramid is constructed in the process of extracting ORB feature points from the input image and feature point detection is performed on different layers of the pyramid, so as to achieve the feature of unchanged scale.

5. The semantic SLAM method based on GMS feature matching in a dynamic scene, according to claim 4, wherein the specific steps of ORB feature point extraction are as follows:

s3.3: the image pyramid is constructed by scaling the original image sequence in a certain proportion; extracting corresponding angular points from the images with different sizes at each level of the image pyramid;

6. The semantic SLAM method based on GMS feature matching in a dynamic scenario as claimed in claim 5, wherein in the fourth step, the motion consistency detection is: the static point can satisfy epipolar geometric constraint, if the matching point of the object static feature point re-projected on the reference frame is definitely located on the intersection line of the reference frame and the epipolar plane; the method comprises the following specific steps:

p₁＝[u₁,v₁,1],p₂＝[u₂,v₂,1]

wherein [ X Y Z]^TRepresenting an epipolar vector;

7. The semantic SLAM method based on GMS feature matching in a dynamic scenario as claimed in claim 6, wherein in said step five, GMS algorithm eliminates specific steps of false matching:

s5.1: dividing the image into 20 multiplied by 20 non-repetitive grids, and adjusting the length and width of the grids to perform iterative computation in order to solve the problem that the features are positioned at the boundaries of the grids;

s5.2: dividing 3 x 3 pixels around each feature point into a unit, and calculating the sum S of the total matching number of each unit { i, j } and the corresponding neighbor of the reference frame in the neighbors_ij，

Wherein the content of the first and second substances,

s5.3: calculating a threshold for dividing each cell correct and mismatch, where n_iI.e. the average number of feature points in each grid, where α is an empirical value, and the calculation formula is as follows:

8. The semantic SLAM method based on GMS feature matching in a dynamic scenario as claimed in claim 1, wherein in said step six, the tracking thread is specifically: and estimating the pose of the current frame according to the motion model in the step five, tracking the map point of the previous frame according to the uniform motion model, and determining the pose.

9. The semantic SLAM method based on GMS feature matching in a dynamic scene as claimed in claim 1, wherein the point cloud processing is performed by a local mapping process, and a sparse point cloud map is obtained by local BA optimization.

10. The semantic SLAM method based on GMS feature matching in a dynamic scene as claimed in claim 1, wherein in the eighth step, a bag of words model is used to determine whether a loop is generated, and if so, the accumulated error is corrected by the loop to optimize the pose.