CN117710806A - Semantic visual SLAM method and system based on semantic segmentation and optical flow - Google Patents

Semantic visual SLAM method and system based on semantic segmentation and optical flow Download PDF

Info

Publication number
CN117710806A
CN117710806A CN202311411809.9A CN202311411809A CN117710806A CN 117710806 A CN117710806 A CN 117710806A CN 202311411809 A CN202311411809 A CN 202311411809A CN 117710806 A CN117710806 A CN 117710806A
Authority
CN
China
Prior art keywords
semantic
dynamic
semantic segmentation
key frame
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311411809.9A
Other languages
Chinese (zh)
Inventor
李一鸣
王逸泽
陆刘炜
郭一冉
黄民
周俊莹
邵晨曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Machinery Productivity Promotion Center Co ltd
Beijing Information Science and Technology University
Original Assignee
China Machinery Productivity Promotion Center Co ltd
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Machinery Productivity Promotion Center Co ltd, Beijing Information Science and Technology University filed Critical China Machinery Productivity Promotion Center Co ltd
Priority to CN202311411809.9A priority Critical patent/CN117710806A/en
Publication of CN117710806A publication Critical patent/CN117710806A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a semantic visual SLAM method and a system based on semantic segmentation and optical flow, which are characterized in that a priori dynamic target in an RGB image of a dynamic scene is removed by adopting a semantic segmentation network, and a non priori dynamic target in the RGB image is removed by adopting an optical flow method, so that feature points of the non priori dynamic target and the dynamic target edge can be effectively removed, static target feature points are obtained, matching and pose estimation are carried out, a key frame is generated by utilizing a tracking thread, background restoration is carried out on a part shielded by the dynamic target on the key frame, and the pose estimation accuracy is improved. Combining the image processed by the background restoration technology with the semantic tag information of the static object to obtain a local map containing the semantic tag information; and then, performing point cloud splicing by using pose information to construct a dense global map containing semantic tag information. The problems that a visual SLAM system is easily affected by a dynamic target, so that pose estimation errors are large, the real-time performance of the system is poor, and a semantic map cannot be established are solved.

Description

Semantic visual SLAM method and system based on semantic segmentation and optical flow
Technical Field
The invention relates to the technical field of visual SLAM positioning and mapping, in particular to a semantic visual SLAM method and system based on semantic segmentation and optical flow.
Background
The visual instant positioning and map construction (Simultaneous Localzation and Mappping, SLAM) refers to that the mobile robot obtains external environment information mainly comprising images through a camera in an environment without priori knowledge, and performs pose estimation and environment map construction in the motion process. Currently, most visual SLAM methods assume an external environment as a static scene, and the main part of the scene change is caused by camera motion. In a practical environment, however, there are inevitably moving objects such as a walking person, a running vehicle, and the like. If the dynamic target occupies a large area in the field of view, the situations of mismatching of the characteristic points and failure in tracking of the characteristic points can occur, so that the drift is too large and the positioning is failed. Thus, SLAM methods originally designed for running in static scenes are not able to handle complex dynamic scenes.
To solve this problem, it is necessary to identify and reject dynamic targets from the dynamic environment. In view of such problems, in the field of visual SLAM, a hierarchical image feature extraction method typified by a deep learning technique has appeared in recent years, and a semantic segmentation or object detection algorithm is mostly used to identify a dynamic object in an environment and is successfully applied to SLAM inter-frame estimation. In the prior art, a semantic segmentation network or a target detection network cannot effectively identify a dynamic target without priori knowledge by a deep learning combined system, or the parameters of a network model are too large, the calculation force of the model is too high for hardware, the system is difficult to apply to mobile terminal equipment, and the real-time performance of the system is poor.
Disclosure of Invention
In view of the above, the invention provides a semantic visual SLAM method and a semantic visual SLAM system based on semantic segmentation and optical flow, which can effectively identify dynamic targets with priori knowledge and non-priori knowledge, and improve the positioning accuracy of dynamic scenes and the construction of semantic maps.
The invention adopts the following specific technical scheme:
a semantic visual SLAM method based on semantic segmentation and optical flow, comprising: removing prior dynamic targets in RGB images of a dynamic scene by adopting a semantic segmentation network, removing non-prior dynamic targets in the RGB images by adopting an optical flow method, and obtaining static feature points; performing feature point matching and pose estimation on the static feature points, repositioning the pose of the camera through repositioning threads, determining a key frame of the dynamic scene, and performing background restoration on a part of the key frame, which is shielded by a dynamic target; and constructing a dense global map containing semantic tag information according to the semantic tag information of the static feature points and the key frame for completing background restoration.
Further, before the removing the prior dynamic object in the RGB image of the dynamic scene by using the semantic segmentation network, the method further includes: and acquiring RGB images and depth images of the dynamic scene, and simultaneously inputting the RGB images into a tracking thread and a newly added semantic segmentation thread of an ORB-SLAM2 framework, wherein the semantic segmentation thread is used for realizing the semantic segmentation network, the optical flow method and the background restoration.
Further, the removing the prior dynamic object in the RGB image of the dynamic scene by using the semantic segmentation network includes: and acquiring pixel-level semantic tag information of the RGB image by adopting the semantic segmentation network, and eliminating the priori dynamic target according to the semantic tag information.
Further, the semantic segmentation network adopts a lightweight network MobileNet V2 as a backbone network.
Further, the removing the non-prior dynamic target in the RGB image by using the optical flow method includes: and acquiring the average movement speed of the characteristic points of the RGB image, judging the characteristic points larger than a preset speed threshold as dynamic characteristic points, and eliminating the dynamic characteristic points to eliminate non-prior dynamic targets in the RGB image.
Further, according to the semantic tag information of the static feature points and the key frame for completing background restoration, a dense global map containing the semantic tag information is constructed, and the method comprises the following steps: obtaining a local map containing semantic information according to the semantic tag information of the static feature points and the key frame for completing background restoration; and according to the pose information obtained by the pose estimation, performing point cloud splicing on the local map to obtain a dense global map containing semantic tag information.
Further, a tracking thread of the ORB-SLAM2 framework repositions the pose of the machine through the repositioning thread to determine the keyframes of the dynamic scene.
Further, performing background repair on the part of the key frame blocked by the dynamic target comprises the following steps: selecting an n-frame key frame before a current key frame, setting association weights for the current key frame and the n-frame key frame, and projecting image color information and image depth information of the n-frame key frame to the current key frame by combining the association weights.
A dynamic scene image construction system based on semantic segmentation and optical flow, comprising: an ORB-SLAM2 framework structure and a semantic segmentation thread, wherein the semantic segmentation thread is a concurrent thread of a tracking thread of the ORB-SLAM2 framework structure; the semantic segmentation thread comprises a semantic segmentation module, an optical flow calculation module and a background restoration module; the semantic segmentation module is used for removing priori dynamic targets in RGB images of the dynamic scene by adopting a semantic segmentation network; the optical flow calculation module is used for removing non-prior dynamic targets in the RGB image by adopting an optical flow method to obtain static characteristic points; the background restoration module is used for restoring the background of the part shielded by the dynamic target on the key frame; the tracking thread of the ORB-SLAM2 framework structure is used for carrying out feature point matching and pose estimation on the static feature points, repositioning the pose of the machine through a repositioning thread, determining a key frame of the dynamic scene, and constructing a dense global map containing semantic tag information according to the semantic tag information of the static feature points and the key frame for completing background restoration.
Further, in the semantic segmentation module, the semantic segmentation network adopts a lightweight network MobileNet V2 as a backbone network.
The beneficial effects are that:
(1) A semantic visual SLAM method based on semantic segmentation and optical flow adopts a semantic segmentation network to remove priori dynamic targets in RGB images of dynamic scenes, adopts an optical flow method to remove non-priori dynamic targets in the RGB images, can effectively remove feature points of the non-priori dynamic targets and the edges of the dynamic targets, obtains static target feature points, performs matching and pose estimation, utilizes tracking threads to generate key frames, and improves the accuracy of pose estimation.
(2) The semantic segmentation network adopts a lightweight network MobileNet V2 as a backbone network, so that the lightweight of the feature extraction model structure can be realized, the processing speed is improved, and the instantaneity of the SLAM system is ensured.
(3) Selecting n frames of key frames before the current key frame, setting association weights for the current key frame and the n frames of key frames, carrying out background restoration on RGB-D images of the dynamic scene by combining the association weights, effectively carrying out background restoration on the blocked part of the dynamic target, providing more accurate matching information for repositioning links of the ORB-SLAM2 frame, and further improving positioning accuracy.
Drawings
FIG. 1 is a flow chart of a semantic vision SLAM method and system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a semantic vision SLAM method and system according to an embodiment of the present invention.
Detailed Description
The invention will now be described in detail by way of example with reference to the accompanying drawings.
In order to solve the problem that the above visual SLAM system is susceptible to a dynamic target, resulting in a large pose estimation error and poor real-time performance of the system, an embodiment of the present invention provides a semantic visual SLAM method and system based on semantic segmentation and optical flow, and fig. 1 is a flowchart of the semantic visual SLAM method and system according to the embodiment of the present invention, as shown in fig. 1, including:
step S101, eliminating priori dynamic targets in RGB images of a dynamic scene by adopting a semantic segmentation network, eliminating non-priori dynamic targets in the RGB images by adopting an optical flow method, and obtaining static feature points;
in one exemplary embodiment, a semantic segmentation network is employed to cull a priori dynamic targets in an RGB image of a dynamic scene, comprising: and acquiring pixel-level semantic tag information of the RGB image by adopting a semantic segmentation network, and eliminating the priori dynamic targets according to the semantic tag information.
In one exemplary embodiment, the semantic segmentation network employs a lightweight network MobileNetV2 as the backbone network.
In one exemplary embodiment, removing non-a-priori dynamic objects from an RGB image using optical flow includes: and acquiring the average motion speed of the characteristic points of the RGB image, judging the characteristic points larger than a preset speed threshold as dynamic characteristic points, and eliminating the dynamic characteristic points to eliminate non-prior dynamic targets in the RGB image.
In one exemplary embodiment, before employing the semantic segmentation network to cull a priori dynamic objects in the RGB image of the dynamic scene, the method further comprises: and acquiring RGB images and depth images of the dynamic scene, and simultaneously inputting the RGB images into a tracking thread and a newly added semantic segmentation thread of the ORB-SLAM2 framework, wherein the semantic segmentation thread is used for realizing a semantic segmentation network, an optical flow method and background restoration.
In an actual implementation, image data, including RGB images and depth images, may be acquired by a binocular RGB-D camera.
Step S102, performing feature point matching and pose estimation on static feature points, repositioning the pose of the machine through repositioning threads, determining a key frame of a dynamic scene, and performing background restoration on a part shielded by a dynamic target of the key frame;
in the actual implementation process, feature point matching is carried out on the static feature points to obtain more accurate pose information of the object in the dynamic scene.
In one exemplary embodiment, wherein the tracking thread of the ORB-SLAM2 framework repositions the pose of the machine by repositioning the thread to determine the keyframes of the dynamic scene.
In the actual implementation process, the frames which greatly contribute to the subsequent scene reconstruction are selected as key frames of the dynamic scene.
Step S103, constructing a dense global map containing semantic tag information according to the semantic tag information of the static feature points and the key frame for completing background restoration.
In one exemplary embodiment, constructing a dense global map containing semantic tag information from semantic tag information of static feature points and key frames completing background repair, includes: obtaining a local map containing semantic information according to the semantic tag information of the static feature points and the key frame for completing background restoration; and according to pose information obtained by pose estimation, performing point cloud splicing on the local map to obtain a dense global map containing semantic tag information.
In the actual implementation process, selecting n frames of key frame images before a current frame to be repaired, setting association weights, enabling the weights occupied by key frames at different times to be different, enabling the key frame weights closer to the current frame to be larger, and projecting color information and depth information of the images onto the current frame according to the characteristic point matching relation; combining the image processed by the background restoration technology with the semantic tag information of the static object to obtain a local map containing semantic information; and then, performing point cloud splicing by using pose information to construct a dense global map containing semantic information. Wherein n is a positive integer.
The embodiment of the invention also provides a semantic visual SLAM system based on semantic segmentation and optical flow, which comprises the following steps: ORB-SLAM2 frame structure and added semantic segmentation thread, wherein the semantic segmentation thread is the concurrence thread of the tracking thread of the ORB-SLAM2 frame structure; the semantic segmentation thread comprises a semantic segmentation module, an optical flow calculation module and a background restoration module; the semantic segmentation module is used for removing priori dynamic targets in RGB images of the dynamic scene by adopting a semantic segmentation network; the optical flow calculation module is used for removing the non-priori dynamic targets in the RGB image by adopting an optical flow method to obtain static characteristic points; the background restoration module is used for restoring the background of the part shielded by the dynamic target on the key frame; the ORB-SLAM2 frame structure tracking thread is used for carrying out feature point matching and pose estimation on static feature points, repositioning the pose of the machine through repositioning threads, determining a key frame of a dynamic scene, and constructing a dense global map containing semantic tag information according to the semantic tag information of the static feature points and the key frame for completing background restoration.
In one exemplary embodiment, in the semantic segmentation module, the semantic segmentation network employs a lightweight network MobileNetV2 as a backbone network.
In order to enable those skilled in the art to better understand the technical solutions of the present invention, the following description is provided with reference to specific exemplary embodiments.
Scene embodiment one
The embodiment of the invention, namely the scene embodiment, adopts the technical scheme that a semantic segmentation thread is added on the basis of an ORB-SLAM2 frame (ORiented Brief-Simultaneous Localzation and Mappping, selected characteristic points-instant positioning and map construction), so that a semantic vision SLAM system facing a dynamic scene is realized, and the influence of a dynamic target on a traditional SLAM system is effectively eliminated. The thread comprises a semantic segmentation module, an optical flow calculation module and a background restoration module. The semantic segmentation module uses a lightweight network as a main feature extraction network of the semantic segmentation network to extract dynamic target features, so that a feature extraction network model is light and the prediction speed is improved, and the real-time performance of system operation is ensured.
FIG. 2 is a schematic diagram of a dynamic scene image construction system according to an embodiment of the invention, as shown in FIG. 2, the system of the embodiment of the invention employs four-thread parallelism: tracking threads, semantic segmentation threads, loop detection threads and local mapping threads. The semantic segmentation thread comprises a semantic segmentation module, an optical flow calculation module and a background restoration module.
First, for the ORB-SLAM2 framework, tracking threads (Tracking), also known as visual odometers, are mainly: feature extraction, feature matching, motion estimation, pose optimization, state update and key frame selection. First, ORB feature points are extracted from a current frame, corresponding feature points are searched in a previous frame, and a descriptor matching algorithm is used for matching the feature points in the current frame. Second, the motion of the camera is estimated using optical flow between adjacent frames, and the camera pose is optimized using the BA algorithm to minimize the re-projection error. The estimated camera pose is then used to update the system state. ORB-SLAM2 manages system states, including initialization, tracking, loss, and relocation, through a state machine. If ORB-SLAM2 loses the current frame in the tracking thread, the system will attempt to reposition the pose of the machine by repositioning the thread. And finally, selecting a frame with larger contribution to the subsequent scene reconstruction as a key frame, reducing the calculated amount and improving the system efficiency.
The Local Mapping thread (Local Mapping) is used for building a Local map, which contains feature points and map points observed by a camera. This thread runs in a separate thread that continually receives new camera frames and updated camera poses from the tracking thread and uses this data to construct and update the local map. The main working contents comprise: triangularization, map point screening, map updating, and keyframe selection. When a new camera frame is received by the tracking thread, the local mapping thread will triangulate these feature points to 3D map points using the camera pose. To increase the accuracy of the map, ORB-SLAM2 uses a multi-view geometry algorithm to triangulate map points and a beam-leveling algorithm (Bundle Adjustment, BA) to optimize their position. Then, a greedy strategy-based map point screening algorithm is used, which can quickly and effectively select the most representative map points so as to improve the quality and efficiency of the local map. This strategy will typically select a map point that is closer to the current camera frame and select a map point with a higher quality among the map points in the field of view. And then using an incremental map updating algorithm, and updating the local map according to the pose and the triangularization result of the camera every time a new camera frame is received. The updating process comprises the operations of adding new map points to the map, matching the new map points with the existing map points, updating the positions and descriptors of the existing map points and the like. Meanwhile, the local mapping thread is also responsible for selecting key frames so as to facilitate global map optimization and loop detection by ORB-SLAM 2.
The loop detection thread (loopcounding) is used for detecting whether a loop exists in the camera path or not, and processing the loop. When a loop exists in the camera path, the thread will attempt to match the key frame detected by the loop with the previous key frame to correct the error of the camera path and improve the accuracy of the map. The main working contents are as follows: loop detection and loop matching, and includes relocation function calls for the trace thread. ORB-SLAM2 uses BoW (Bag of Words) model to detect loop back. The model represents each key frame as a bag of words, each bag of words consisting of several words. When two keyframes have similar word bags, they are considered to be different perspectives of the same location. The bag of words of the current frame is compared with the bag of words of the history frame to detect a loop. When the loop thread detects a loop, it optimizes the match between the current frame and the historical frame to correct the error of the camera path. Specifically, it uses the RANSAC algorithm to estimate the transformation matrix between two key frames, and then uses the BA algorithm to optimize the overall loop. The loop detection thread may also be used for relocation when the ORB-SLAM2 is unable to track the camera, for example when the camera moves to a new unexplored area. It will use the feature points of the current frame to search for historical key frames and try to find the best matching key frame to reposition the pose of the camera.
According to the framework principle of the semantic visual SLAM system shown in FIG. 2, the method for constructing the dynamic scene image in the embodiment of the invention comprises the following steps:
step 1: obtaining image data, including an RGB image and a depth image, by a binocular RGB-D camera;
in the embodiment, a color image and a depth image are acquired through a binocular RGB-D camera, and image frames are provided for a tracking thread and a semantic segmentation thread. In practical implementation, the image acquisition mode is not limited to the above-mentioned method.
Step 2: transmitting the RGB image obtained in the step 1 into a tracking thread and a semantic segmentation thread;
the semantic segmentation thread comprises a semantic segmentation module, an optical flow calculation module and a background restoration module.
Step 3: the RGB image eliminates a priori dynamic target in the image through an improved semantic segmentation network;
the method comprises the steps of processing an image through an improved semantic segmentation network, wherein the semantic segmentation network takes a lightweight network as a main network, RGB images are processed through the improved semantic segmentation network to obtain pixel-level semantic information, and feature points on a priori dynamic target in the image are removed by utilizing the semantic information.
In order to ensure the real-time performance and the segmentation effect of the system, the embodiment adopts a lightweight network as a trunk feature extraction network of the semantic segmentation network. The color image is processed by an improved semantic segmentation network to obtain pixel-level semantic information. And eliminating characteristic points on the prior dynamic target in the image by utilizing semantic information.
Firstly, for a traditional semantic segmentation network, such as a semantic segmentation network of a deep Labv3+ network, the main body of an Encoder Encoder of the semantic segmentation network is a deep convolutional neural network (Deep Convolutional Neural Networks, DCNN) with hole convolution, a common classification network such as ResNet or Xnaption is generally adopted, the DCNN is processed and then enters a space pyramid pooling module (Atrous Spatial Pyramid Pooling, ASPP) with hole convolution, and multi-scale information is introduced to reduce the loss of the information; the DeepLabv3+ adopts Xreception as a main feature extraction network, and a Decoder module of the DeepLabv3+ further fuses shallow layer secondary features and deep layer features, so that the separation accuracy of a target boundary is improved.
The embodiment of the invention is based on the deep Labv3+ network, and performs light weight processing on the deep Labv3+ network, and uses MobileNet V2 to replace the DCNN backbone network Xreception. The MobileNet V2 also adopts the depth separable convolution to reduce the model parameter quantity, and simultaneously adopts a reverse residual structure to increase the dimension of the convolution, so as to improve the capability of the model to extract the characteristics, and then adopts a linear bottleneck structure to avoid the damage of a nonlinear activation function to low-dimensional space information and improve the performance of the network. The lightweight network MobileNet V2 is used for replacing the backbone network, so that the problems of large network parameter quantity and excessive dependence on hardware resources are solved, the network has a smaller network model and a faster processing speed while ensuring the precision, and the real-time requirement is met.
The above description of the lightweight network is only illustrative, and in the actual implementation process, different neural networks may be used as the lightweight network for different neural network structures, so that the structure of the original neural network model is improved, and details are not repeated here.
Step 4: further screening non-prior dynamic targets by an optical flow method and removing the non-prior dynamic targets;
the image after semantic segmentation treatment is covered with a mask on the dynamic target to remove the dynamic feature points, but the feature points at the edges of the dynamic target and the non-prior dynamic target are still extracted, so that the dynamic feature points are further removed by screening through an optical flow method.
And 4, further screening the feature points extracted in the step 3, and removing the feature points on the non-prior dynamic target by an optical flow method. All dynamic feature points in the image frame can be screened by comparing the average moving speed, if the average moving speed is larger than a certain threshold value, the feature points are judged to be dynamic feature points, the feature points are deleted when the pose is calculated, and the feature matching and the re-projection pose calculation are not participated.
Optical flow is a method of describing the movement of pixels between images over time. In the optical flow method, the image can be seen as a function of time, and at time t, the gray scale can be written as I (x, y, t) at the feature point at (x, y).
From the gray scale invariant assumption, it can be seen that:
wherein: dx/d, tdy/d is the movement speed of the feature point in the x and y directions, and is respectively marked as u and v;the gradients of the feature points in the directions are respectively denoted as I x ,I y
And adding other constraints for solving the movement speed of the characteristic points. Let the pixels in the same area have the same motion, i.e. take a square area of 8 x 8, and the 64 pixels in the area all have the same motion, then the equation is:
let M be the set of all feature points M= { M 1 ,m 2 ,...,m n Calculating the average motion speed of all the characteristic points in the set according to the above formula, wherein the average motion speed is as follows:
all dynamic feature points in the set M can be screened by comparing the average movement speed, and the following formula is shown:
d is a judging threshold value, if the characteristic point is larger than the threshold value, the characteristic point is judged to be a dynamic characteristic point, the characteristic point is deleted when the pose is calculated, and the characteristic matching and the re-projection pose calculation are not participated.
Step 5: performing feature point matching and pose estimation by using the screened static feature points;
step 6: generating key frames by using tracking threads;
after the partial semantic segmentation thread and the whole tracking thread, key frames are generated, and the background blocked by the dynamic target mask is repaired by the key frames.
Step 7: performing background restoration and mapping according to the related data of the key frames;
the method comprises the steps of selecting 20 key frame images before a current frame to be repaired, setting associated weights, enabling weights occupied by key frames at different moments to be different, and repairing color information and depth information of the images by adopting a time weighted multi-frame fusion technology.
When the algorithm is lost due to too few matching points of the two frames of images, a repositioning link of a tracking thread is entered, the characteristic matching is carried out on the current frame and all candidate key frames, and the repaired candidate key frames are more abundant in information and more beneficial to pose estimation and subsequent loop detection threads and image building threads of an SLAM system (namely a dynamic scene image building system).
Selecting n frames of key frame images before a current frame to be repaired, setting associated weights, enabling the weights occupied by key frames at different moments to be different, enabling the key frame weights closer to the current frame to be larger, projecting color information and depth information of the images onto the current frame, and enabling a calculation formula to be:
f in the formula s F for the current key frame after repair c To unrepaired current key frame, F i At t i Time key frame. The result of the restored image is a static keyframe that removes dynamic interference. After the previous static key frame is used for repairing, the color image information and the depth image information of the background blocked by the dynamic target are restored.
In the embodiment of the invention, the detection and elimination of the dynamic target in the environment are realized by utilizing an improved semantic segmentation network in combination with an optical flow method, and the characteristic points on the static target are obtained and matched and pose estimated. And (3) performing background restoration on the blocked part of the dynamic target by using a time-weighted multi-frame fusion technology on the key frame, providing more accurate matching information for the repositioning link, and further improving the positioning accuracy.
In summary, the embodiment of the invention provides a semantic visual SLAM method and a semantic visual SLAM system based on semantic segmentation and optical flow, which effectively eliminate the influence of a dynamic target on a traditional SLAM system by adding a semantic segmentation thread on the basis of an ORB-SLAM2 framework. Features and priori semantic information in the environment are extracted based on the semantic segmentation network, and a lightweight network is used as a main feature extraction network of the network, so that the parameter quantity and the operand of the network are reduced, and the real-time performance of the system operation is ensured. The method is based on an improved semantic segmentation network combined with a dynamic target detection and rejection method of an optical flow method, so that the detection and rejection of the dynamic target in the indoor environment are realized, the semantic segmentation network segments the priori dynamic target, the optical flow method detects the non-priori dynamic target, and finally, the characteristic points on the static target are obtained and matched and pose estimated. In order to further improve the positioning accuracy, a background restoration technology based on time weighted multi-frame fusion is provided, background restoration is carried out on the part shielded by the dynamic target on the key frame, and more accurate matching information is provided for the repositioning link.
The above specific embodiments merely describe the design principle of the present invention, and the shapes of the components in the description may be different, and the names are not limited. Therefore, the technical scheme described in the foregoing embodiments can be modified or replaced equivalently by those skilled in the art; such modifications and substitutions do not depart from the spirit and technical scope of the invention, and all of them should be considered to fall within the scope of the invention.

Claims (10)

1. A semantic visual SLAM method based on semantic segmentation and optical flow, comprising:
removing prior dynamic targets in RGB images of a dynamic scene by adopting a semantic segmentation network, removing non-prior dynamic targets in the RGB images by adopting an optical flow method, and obtaining static feature points;
performing feature point matching and pose estimation on the static feature points, repositioning the pose of the camera through repositioning threads, determining a key frame of the dynamic scene, and performing background restoration on a part of the key frame, which is shielded by a dynamic target;
and constructing a dense global map containing semantic tag information according to the semantic tag information of the static feature points and the key frame for completing background restoration.
2. The method of claim 1, wherein prior to the employing the semantic segmentation network to cull a priori dynamic objects in the RGB images of the dynamic scene, the method further comprises:
and acquiring RGB images and depth images of the dynamic scene, and simultaneously inputting the RGB images into a tracking thread and a newly added semantic segmentation thread of an ORB-SLAM2 framework, wherein the semantic segmentation thread is used for realizing the semantic segmentation network, the optical flow method and the background restoration.
3. The method of claim 1, wherein the employing the semantic segmentation network to cull a priori dynamic targets in the RGB images of the dynamic scene comprises:
and acquiring pixel-level semantic tag information of the RGB image by adopting the semantic segmentation network, and eliminating the priori dynamic target according to the semantic tag information.
4. A method according to claim 3, wherein the semantic segmentation network employs a lightweight network MobileNetV2 as a backbone network.
5. The method of claim 1, wherein the removing the non-a priori dynamic objects from the RGB image using optical flow comprises:
and acquiring the average movement speed of the characteristic points of the RGB image, judging the characteristic points larger than a preset speed threshold as dynamic characteristic points, and eliminating the dynamic characteristic points to eliminate non-prior dynamic targets in the RGB image.
6. The method of claim 1, wherein constructing a dense global map containing semantic tag information from semantic tag information of the static feature points and the key frames completing background repair comprises:
obtaining a local map containing semantic information according to the semantic tag information of the static feature points and the key frame for completing background restoration;
and according to the pose information obtained by the pose estimation, performing point cloud splicing on the local map to obtain a dense global map containing semantic tag information.
7. The method of claim 1, wherein,
the tracking thread of the ORB-SLAM2 framework repositions the pose of the machine through the repositioning thread to determine the keyframes of the dynamic scene.
8. The method of claim 1, wherein background repairing the dynamically targeted occluded portion of the keyframe comprises:
selecting an n-frame key frame before a current key frame, setting association weights for the current key frame and the n-frame key frame, and projecting image color information and image depth information of the n-frame key frame to the current key frame by combining the association weights. .
9. A dynamic scene image construction system based on semantic segmentation and optical flow, comprising:
an ORB-SLAM2 framework structure and a semantic segmentation thread, wherein the semantic segmentation thread is a concurrent thread of a tracking thread of the ORB-SLAM2 framework structure; the semantic segmentation thread comprises a semantic segmentation module, an optical flow calculation module and a background restoration module;
the semantic segmentation module is used for removing priori dynamic targets in RGB images of the dynamic scene by adopting a semantic segmentation network; the optical flow calculation module is used for removing non-prior dynamic targets in the RGB image by adopting an optical flow method to obtain static characteristic points; the background restoration module is used for restoring the background of the part shielded by the dynamic target on the key frame;
the tracking thread of the ORB-SLAM2 framework structure is used for carrying out feature point matching and pose estimation on the static feature points, repositioning the pose of the machine through a repositioning thread, determining a key frame of the dynamic scene, and constructing a dense global map containing semantic tag information according to the semantic tag information of the static feature points and the key frame for completing background restoration.
10. The system of claim 9, wherein in the semantic segmentation module, the semantic segmentation network employs a lightweight network MobileNetV2 as a backbone network.
CN202311411809.9A 2023-10-27 2023-10-27 Semantic visual SLAM method and system based on semantic segmentation and optical flow Pending CN117710806A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311411809.9A CN117710806A (en) 2023-10-27 2023-10-27 Semantic visual SLAM method and system based on semantic segmentation and optical flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311411809.9A CN117710806A (en) 2023-10-27 2023-10-27 Semantic visual SLAM method and system based on semantic segmentation and optical flow

Publications (1)

Publication Number Publication Date
CN117710806A true CN117710806A (en) 2024-03-15

Family

ID=90159494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311411809.9A Pending CN117710806A (en) 2023-10-27 2023-10-27 Semantic visual SLAM method and system based on semantic segmentation and optical flow

Country Status (1)

Country Link
CN (1) CN117710806A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118155175A (en) * 2024-04-22 2024-06-07 神鳍科技(上海)有限公司 Dynamic scene reconstruction method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118155175A (en) * 2024-04-22 2024-06-07 神鳍科技(上海)有限公司 Dynamic scene reconstruction method and system

Similar Documents

Publication Publication Date Title
CN111462200B (en) Cross-video pedestrian positioning and tracking method, system and equipment
Engel et al. Large-scale direct SLAM with stereo cameras
CN111724439B (en) Visual positioning method and device under dynamic scene
CN110378345B (en) Dynamic scene SLAM method based on YOLACT instance segmentation model
CN111968129A (en) Instant positioning and map construction system and method with semantic perception
CN110688905B (en) Three-dimensional object detection and tracking method based on key frame
CN106683121A (en) Robust object tracking method in fusion detection process
CN109974743B (en) Visual odometer based on GMS feature matching and sliding window pose graph optimization
CN112446882A (en) Robust visual SLAM method based on deep learning in dynamic scene
CN112001859A (en) Method and system for repairing face image
CN113744315B (en) Semi-direct vision odometer based on binocular vision
CN117710806A (en) Semantic visual SLAM method and system based on semantic segmentation and optical flow
CN110599522A (en) Method for detecting and removing dynamic target in video sequence
CN110599545A (en) Feature-based dense map construction system
CN106558069A (en) A kind of method for tracking target and system based under video monitoring
CN117036404A (en) Monocular thermal imaging simultaneous positioning and mapping method and system
Min et al. Coeb-slam: A robust vslam in dynamic environments combined object detection, epipolar geometry constraint, and blur filtering
CN113888603A (en) Loop detection and visual SLAM method based on optical flow tracking and feature matching
Zhuang et al. Amos-SLAM: An Anti-Dynamics Two-stage SLAM Approach
Xie et al. Hierarchical quadtree feature optical flow tracking based sparse pose-graph visual-inertial SLAM
Zhao et al. A robust stereo semi-direct SLAM system based on hybrid pyramid
CN114202579B (en) Dynamic scene-oriented real-time multi-body SLAM system
Zhao et al. An object tracking algorithm based on occlusion mesh model
CN110276233A (en) A kind of polyphaser collaboration tracking system based on deep learning
Li Research on rgb-d slam dynamic environment algorithm based on yolov8

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination