CN113298904A - Monocular vision-based positioning and map construction method - Google Patents

Monocular vision-based positioning and map construction method Download PDF

Info

Publication number
CN113298904A
CN113298904A CN202110591607.1A CN202110591607A CN113298904A CN 113298904 A CN113298904 A CN 113298904A CN 202110591607 A CN202110591607 A CN 202110591607A CN 113298904 A CN113298904 A CN 113298904A
Authority
CN
China
Prior art keywords
map
frame
key frame
dynamic
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110591607.1A
Other languages
Chinese (zh)
Other versions
CN113298904B (en
Inventor
齐咏生
陈培亮
刘利强
李永亭
董朝铁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN202110591607.1A priority Critical patent/CN113298904B/en
Publication of CN113298904A publication Critical patent/CN113298904A/en
Application granted granted Critical
Publication of CN113298904B publication Critical patent/CN113298904B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/77Retouching; Inpainting; Scratch removal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a positioning and map building method based on monocular vision, which comprises the following steps: (1) processing the video frame of the created map by adopting a Mask R-CNN neural network, and segmenting a priori dynamic targets in the environment to obtain segmented image frames; (2) positioning the segmented image frames in a map using a low cost tracking module; (3) the segmented image frames processed by the steps are tracked, detected and positioned by a multi-view geometric module; (4) carrying out background restoration based on time weighted filtering on the background of the part, which is shielded by the dynamic target, in the map; (5) and (4) acquiring the maps established in the steps (1) to (4), when map tracking failure occurs, generating a new local map in a self-adaptive manner, and fusing the map and the previously established map to realize a multi-map construction thread during loop returning. The method can effectively extract various dynamic targets in the environment.

Description

Monocular vision-based positioning and map construction method
Technical Field
The invention relates to the field of monocular visual positioning and map building, in particular to a monocular visual positioning and map building method.
Background
As the development of intelligent technology has prompted the development of intelligent mobile robot technology to a new stage, synchronous positioning and mapping (SLAM) technology, as a basic capability of intelligent mobile robots, has become a hot issue in the field of robot research. Before 2000, the SLAM algorithm of the mobile robot is mainly realized by using the laser radar, and after 2000, the study gradually turns to the SLAM algorithm based on machine vision.
There are three main types of cameras used in the visual SLAM algorithm, which are: monocular cameras, binocular cameras, and RGB-D cameras. The monocular camera has the advantages of low price, low power consumption, small volume, convenience in installation, large amount of acquired information, suitability for large-scale scenes and the like, so that the monocular camera-based V-SLAM algorithm is widely applied. To date, some more classical monocular SLAM (V-SLAM) algorithms have been developed, such as: MonoSLAM, PTAM, ORB-SLAM, etc.
However, the conventional V-SLAM is based on the laser radar, but the SLAM algorithm based on the laser radar can only establish a two-dimensional planar map, the map information is incomplete, and is limited by various environmental factors, and various Visual synchronous positioning and Mapping (V-SLAM for short) algorithms gradually emerge in the last decade, but the conventional V-SLAM algorithm has various defects, such as: after the tracking is lost, the images cannot be continuously built, the correct environment map cannot be built under the dynamic environment and the like, so that the traditional V-SLAM algorithm is difficult to realize the synchronous positioning and image building of the robot in a complex and changeable actual scene, and the traditional algorithm has two main problems: (1) the problem that an environment real map cannot be established in a dynamic environment; (2) the problem that the V-SLAM algorithm cannot continue to build graphs after a lost follow-up (i.e., similar to "kidnapping" of a robot).
Disclosure of Invention
The invention aims to provide a positioning and map building method based on monocular vision, which provides a solution for two problems in the prior art, detects various dynamic targets in the environment by introducing a method combining deep learning and multi-view geometry technology, detects the prior dynamic targets in the environment by using a Mask R-CNN neural network, detects various moving random dynamic targets in the environment by using the multi-view geometry technology on the basis, does not track and build the map for the dynamic targets in the tracking and map building process, and can overcome various influences of dynamic objects on an algorithm; then, synthesizing an image frame for repairing a background map by using a background repairing algorithm based on time weighted filtering, and performing smooth filtering to realize background repairing on the background shielded by the dynamic object; and finally, a multi-map construction thread is designed by utilizing the thought of multi-map construction, so that the problem that the traditional V-SLAM algorithm cannot continue to track and construct a map after the algorithm is lost is solved, and compared with the traditional V-SLAM algorithm, the method has stronger robustness, the constructed map is more complete, the adaptability to a dynamic environment is stronger, and the method has better application value.
The invention provides a monocular vision-based positioning and map building method, which comprises the following steps:
(1) processing the video frame of the created map by adopting a Mask R-CNN neural network, and segmenting a priori dynamic targets in the environment to obtain segmented image frames;
(2) positioning the segmented image frames in a map using a low cost tracking module;
(3) the segmented image frames processed by the steps are tracked, detected and positioned by a multi-view geometric module;
(4) carrying out background restoration based on time weighted filtering on the background of the part, which is shielded by the dynamic target, in the map;
(5) and (4) acquiring the maps established in the steps (1) to (4), when map tracking failure occurs, generating a new local map in a self-adaptive manner, and fusing the map and the previously established map to realize a multi-map construction thread during loop returning.
Firstly, the monocular vision-based positioning and map building method introduces a dynamic object detection mechanism combining deep learning and multi-view geometry based on RCNN, so that the two methods are combined because of the advantages of the two methods, and different target conditions are solved. Deep learning has very good detection accuracy for a priori dynamic targets, but the detection rate is lower for some accidental or unlearned dynamic targets, such as: if a person moves with a book, the person is a priori dynamic target, and the book is not a priori dynamic target, the existence of the book is difficult to detect through deep learning. The multi-scale geometric technology just makes up for the defects, and because the multi-scale geometric technology utilizes the space geometric scale to calculate the pose, the multi-scale geometric technology is sensitive to the moving target. However, the multi-view geometry technique has the disadvantage that it hardly detects a dynamic object for a slowly varying or temporarily stationary object, such as a temporarily stationary dynamic object person.
Secondly, aiming at the problem of background restoration in map construction, the invention provides a multi-frame fusion algorithm based on time weighted filtering, and the multi-frame fusion algorithm is used for restoring the background of the map of the shielded part of the dynamic object.
Finally, the invention introduces a multi-map construction idea. Relocation in a conventional V-SLAM algorithm (e.g., ORB-SLAM2 algorithm) is a greedy search process that uses the current frame to match all previous key frames, which is time consuming, labor consuming, and prone to fall into dead-loops. Therefore, the invention proposes to adopt local multi-map construction to replace a relocation link in the traditional algorithm, namely when algorithm tracking loss occurs, a new local map is directly established to realize the continuous tracking and map construction after the algorithm tracking loss.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic diagram of a dynamic detection and background repair process algorithm in the present invention;
FIG. 2 is a schematic diagram of a static object and dynamic object detection algorithm in the present invention (left diagram: static object; right diagram: dynamic object);
FIG. 3 is a diagram of a background repair algorithm in accordance with the present invention;
FIG. 4 is a schematic diagram of a multi-map construction thread algorithm in the present invention;
FIG. 5 shows the result of running the ORB-SLAM2 algorithm in the fr 3-shaping-xyz sequence;
FIG. 6 shows the result of the ORBSLAMM algorithm of the present invention running in the fr 3-shaping-xyz sequence;
FIG. 7 shows the results of running the DynasLAM algorithm of the present invention on the fr 3-shaping-xyz sequence;
FIG. 8 shows the result of the DE-SLAMM algorithm of the present invention running in the fr 3-shaping-xyz sequence;
FIG. 9 shows the result of the ORB-SLAM2 algorithm running on the fr3-walking-xyz sequence;
FIG. 10 shows the result of the ORBSLAMM algorithm in the present invention running on fr3-walking-xyz sequence;
FIG. 11 shows the results of running the DynasLAM algorithm of the present invention in the fr3-walking-xyz sequence;
FIG. 12 shows the result of the DE-SLAMM algorithm of the present invention running in the fr3-walking-xyz sequence;
FIG. 13 shows the result of running the ORB-SLAM2 algorithm in the fr1-floor sequence in accordance with the present invention;
FIG. 14 shows the results of running the DynasLAM algorithm of the present invention in the fr1-floor sequence;
FIG. 15 shows the results of the ORBSLAMM algorithm of the present invention operating in the fr1-floor sequence;
FIG. 16 shows the operation result of the DE-SLAMM algorithm in the present invention in the fr1-floor sequence;
fig. 17 is a flowchart illustrating a monocular visual positioning and mapping method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
The monocular vision-based positioning and map building method specifically comprises the following steps:
A. dynamic target detection and background repair thread:
the dynamic detection and background repair process mainly comprises 4 parts: a Mask R-CNN neural network module, a low-cost tracking module, a multi-view geometry module and a background restoration module, as shown in FIG. 1.
1) Dynamic target detection
As shown in fig. 1, a Mask R-CNN neural network is used to process a video frame to obtain a segmented frame 1 and a segmented frame 2 for segmenting a priori dynamic target.
However, the priori dynamic target cannot cover all dynamic targets, and therefore, a low-cost tracking module needs to be designed for simplifying a tracking thread, a camera is positioned in a created scene map by adopting the low-cost tracking module, map points generated by a local mapping thread are re-projected into a segmented image frame from the map scene, feature points are searched in the image frame, feature points in a static area are reserved, the feature points in a dynamic area are deleted, and the image frame is transmitted into a multi-view geometric module.
The low-cost tracking module is a module obtained by simplifying a tracking thread, positions the camera in the created scene map by adopting the low-cost tracking module, and lays a cushion (such as the chair) for detecting potential dynamic objects. Firstly, acquiring an image frame processed by Mask R-CNN, and positioning a camera to an established map scene; secondly, re-projecting map points generated by a local mapping process into the segmented image frame from the map scene, searching feature points in the image frame, reserving feature points of a static area, and deleting feature points of a dynamic area; finally, the image frame is passed to a multi-view geometry module.
And detecting the random object moving by using a multi-view geometric module. The method comprises the following specific steps:
a) the input frame is selected to have the key frame with the highest degree of overlap, and in order to fully consider the distance between the new frame and each key frame and the rotation requirement, a threshold (set to 5 in the present invention) is set for the number of overlapped key frames.
b) Calculating the depth z of the projection of the pixel point x of the previous frame to the pixel point x' of the current frame by using a triangulation methodprojThe following are:
Figure BDA0003089482100000051
in the formula:
Figure BDA0003089482100000052
-an antisymmetric matrix of x'.
The parallax angle alpha between the back-projections of x and x' is calculated.
Figure BDA0003089482100000053
In the formula:
Figure BDA0003089482100000054
in the figure
Figure BDA0003089482100000055
Vector quantity;
Figure BDA0003089482100000056
in the figure
Figure BDA0003089482100000057
And (5) vector quantity.
It is judged whether or not the angle α is larger than a set threshold value β (30 ° in this example). If the threshold β is too large, a weakly moving object may not be detected, and if the threshold β is too small, an immobile object may be detected as a moving object. If the value is larger than beta, the key point is considered to be possibly blocked, and then the key point is ignored. However, sometimes the static point may be larger than β, so a limitation condition needs to be added, that is, first obtaining the depth z 'of the current frame pixel point x', and then calculating the reprojection error zprojFinally, z' is compared with zprojThe difference value of (a) z is,
Δz=zproj-z' (3)
in the formula: z is a radical ofproj-reprojection errors; z' -depth.
If Δ z exceeds the threshold τzIf yes, the pixel point x 'is considered to belong to the dynamic object, otherwise, the pixel point x' is considered to belong to the dynamic object. Therefore, as long as Δ z > τ is satisfiedzAnd if so, regarding the pixel point as a dynamic point, and neglecting the corresponding dynamic point subsequently. Fig. 2 is a schematic diagram illustrating detection of static objects and dynamic objects.
c) After the dynamic object is correctly judged, the characteristic points contained in the dynamic object are removed to generate a segmentation frame, and the segmentation frame is transmitted to a tracking thread for tracking.
2) Background remediation
Fig. 3 shows a multi-frame fusion background repair algorithm based on time weighted filtering. From the current moment, backtrack the key frame images of the previous n moments to carry out background restoration, make the fusion weights occupied by the key frames at different times different, the weight of the key frame closer to the current frame is larger, then:
Figure BDA0003089482100000061
in the formula: KFSi—tiA key frame of a moment; KFSc-current key frame.
This allows n key frames to be used to synthesize a key frame for background repair. However, there is still a possibility that a small amount of the blank part of the synthesized key frame is not repaired, and at this time, the pixels of the synthesized key frame may be smoothed to repair the small amount of the unrepaired part, and the calculation process is as follows:
if the pixel position is ith1~i2Go to j1~j2If the column is not repaired, setting a smoothing threshold k for smoothing filtering, wherein the calculation formula is as follows:
Figure BDA0003089482100000062
in the formula: u. ofi,j-pixel values of ith row and jth column of the image before image inpainting; u'i,j-pixel values of ith row and jth column of the image after image restoration. Therefore, image restoration can be realized, and finally, the restored image is used as a key frame to build a map, so that the map background can be restored.
B. Multiple map construction threads:
the multi-map construction thread has the main functions of storing a plurality of maps established by the local construction thread and a local key frame database thereof, detecting whether a loop phenomenon exists between a current frame and a previously stored map or not, performing map fusion if a loop exists, and optimizing the camera pose at the same time, as shown in fig. 4. The specific workflow of the multi-map construction thread is as follows:
(1) when the algorithm starts to run, the tracking thread will create a first map M0And its local key frame database KFS0And the data is transmitted into a multi-map construction thread, and then a tracking thread, a local map construction thread and a loop detection thread are used for map M0The multi-map building thread is in an idle state as long as the tracking thread is not lost.
(2) After the nth tracking thread is lost, the tracking thread creates a new map MnAnd the key point thereofFrame database KFSnAnd transmitting the data to a global map M and a global key frame database in the multi-map construction thread.
(3) The tracking thread attempts to reinitialize, and once the initialization is successful, each thread is notified to transfer the tracking and mapping tasks to the local map MnIn operation, at the moment, the multi-map construction thread scans the local key frame database KFS of the new mapnAnd matching the stored key frames with the key frames stored in the previous global key frame database.
(4) The matching method is similar to the loop detection process, and the multi-map construction thread traverses all the local key frame databases (KFS) before traversing0~KFSn-1) And using formula (6) to calculate the minimum similarity score between all previous key frames and the current key frame to inquire whether the previous map has the current key frame KcMatching key frames.
Figure BDA0003089482100000071
(5) For each Mi∈[M0~Mn-1]Keyframe K in a local mapjIf it is combined with KcHaving more than fifteen matching points, let the solver calculate the similarity transformation between them, or for each KjPerforming random sample consensus (RANSC) iterations until K is found with enough matching pointsjOr all candidate frames fail. If a similarity transformation can be returned after random sampling consistency (RANSC) iteration, the similarity transformation can be optimized, and if enough matching points remain after optimization, K isjIs considered a loop key frame. And all are at KjAnd the map points seen in its adjacent frames will all be at KcDetected and re-projected, and then more matched frames are searched by using the calculated similarity transformation. If the number of points corresponding to all the matched frames exceeds the threshold value, the frame is considered to be a loop.
(6) Then the multi-map construction thread performs map fusion on the map with the loop detection and the current map,and optimizing the pose of the camera, and finally generating a new global map M'. The map fusion method comprises the following steps: the multi-map construction thread calculates K firstcAnd KjOf the similarity transformation matrix ScwReuse of ScwMap of local area MnAnd map M for generating a loopiConnected together by loop fusion (loop fusion in same loop detection), if the two maps are fused for the first time, M is retrievednOtherwise only K is retrievedcThen K is added to the adjacent frame and map pointcSet as ScwCorrecting, and converting the pose of each detected key frame into a map M by the following equationiCoordinates in the coordinate system:
Tic=Tiw*Twc (7)
Tcorr=Tic*Twc (8)
in the formula: t isiw-key frame poses retrieved before correction; t iswcK before correctioncThe reverse posture of (1); t iscorrAt MiAnd correcting the pose of the searched key frame in the coordinate system.
In summary, the map points of each retrieved key frame and its neighboring frames are corrected to map MiCoordinates in a coordinate system, then KjAnd its neighboring frame map points projected to KcAnd its neighboring frames, thus completing the fusion between the maps. See figure 17 for details of the overall workflow.
In addition, in order to verify the effectiveness of the algorithm, the algorithm is tested by adopting a plurality of collected typical video sequence tests, and the method comprises the steps of initializing and tracking speed, constructing a plurality of maps after the algorithm is lost, detecting the effect and effectiveness problems in a dynamic environment and the like. Finally, the DE-SLAMM algorithm proposed by the present invention was compared in performance with typical classes of V-SALM algorithms (e.g., ORB-SLAM2 algorithm, ORBSLAMM algorithm, and DynaSLAM algorithm). The hardware environment of the experimental platform is Intel Core i7-10750 CPU @2.6GHz x6 cores and 16GB RAM, and the experimental result shows that the algorithm can run in real time at the frame rate of each sequence.
A. Dynamic target detection and background repair performance testing:
in order to test the dynamic target detection performance and the background repair performance of the algorithm, experiments are carried out by utilizing fr3-sitting-xyz and fr3-walking-xyz video sequences in the collected data set, and the experiments are compared with an ORB-SLAM2 algorithm, an ORBSLAMM algorithm and a dynaSLAM algorithm.
1) Dynamic target detection performance testing
The fr3-sitting-xyz video sequence is that the camera moves in X, Y, Z three directions, and a person in the fr3-sitting-xyz video sequence only moves on a chair with small action amplitude, so that the environment of the fr3-sitting-xyz video sequence belongs to a weak dynamic environment.
In a weak dynamic environment, a person only sits on a chair, the action amplitude is small, the camera moves in X, Y, Z three directions, the experimental results of the ORB-SLAM2 algorithm and the ORBSLAMM algorithm in the fr3-sitting-xyz video sequence are shown in FIGS. 5 and 6, when the ORB-SLAM2 algorithm and the ORBSLAMM algorithm are used for extracting characteristic points for tracking and mapping, points on a dynamic object (the person and the chair) are extracted, and the dynamic object is built into a map as a part of the map, so that the ORB-SLAM2 algorithm and the ORBSLAMM algorithm cannot segment dynamic objects in the environment in the weak dynamic environment; the experimental result of the DynaSLAM algorithm in the fr 3-shaping-xyz video sequence is shown in fig. 7, and the DynaSLAM algorithm can detect and segment a priori dynamic objects (human) in the video. Because the algorithm does not track and map the feature points extracted from the prior dynamic target, the algorithm cannot detect a non-prior dynamic target (a moving chair), and sometimes misdetection occurs, and a static object is also misdetected as a dynamic object, the DynaSLAM algorithm in a weak dynamic environment cannot segment all dynamic targets in the environment. The algorithm of the invention utilizes a Mask R-CNN neural network framework to detect and segment the prior dynamic target (human) in the environment, and then utilizes a multi-view geometry technology to detect and segment the chair in shaking, and does not extract characteristic points of the segmented dynamic target (human and the chair in moving) for tracking and mapping, as shown in figure 8. Therefore, the algorithm can almost segment all dynamic targets in the environment in a weak dynamic environment, and the dynamic target detection performance of the algorithm is superior to that of other three V-SLAM algorithms in the example.
2) Background repair Performance test
The fr3-walking-xyz video sequence is that the camera moves in X, Y, Z three directions, and the person in the fr3-walking-xyz video sequence walks around with a large action amplitude. Therefore, the environment to which the fr3-walking-xyz video sequence belongs to a strong dynamic environment.
In a strong dynamic environment, a person walks back and forth, the action amplitude is large, the camera moves in X, Y, Z three directions, the experimental results of the ORB-SLAM2 algorithm and the ORBSLAMM algorithm in the fr3-walking-xyz video sequence are shown in fig. 9 and 10, fewer feature points are extracted from dynamic objects (people and chairs) by the ORB-SLAM2 algorithm and the ORBSLAMM algorithm, and the extracted feature points of the dynamic objects can still be tracked and mapped, so that the ORB-SLAM2 algorithm cannot segment dynamic targets in the environment in the strong dynamic environment and cannot establish a real map of the environment; while the experimental results of the DynaSLAM algorithm in the fr3-walking-xyz video sequence are shown in fig. 11. The method can detect the prior dynamic target (human) in the video and segment the prior dynamic target, does not track and map the characteristic points extracted from the prior dynamic target, but cannot detect the non-prior dynamic target (moving chair), sometimes has the condition of false detection, and falsely detects the static object as the dynamic object. Thus, the DynaSLAM algorithm in a strongly dynamic environment cannot build a true map of the environment. The algorithm detects and segments the prior dynamic target (human) in the environment by using a Mask R-CNN neural network framework, detects and segments the chair in shaking by using a multi-view geometry technology, and does not track and map the extracted feature points of the segmented dynamic target (human and moving chair), as shown in FIG. 12. In addition, the algorithm of the invention also repairs the background blocked by the dynamic target in the k frame by using a background repairing algorithm of multi-frame fusion, so that the sparse map established by the algorithm is obviously more than the map points of the sparse map established in the figure 11. In conclusion, the algorithm can almost segment all dynamic targets in the environment in a strong dynamic environment and can also repair a background map shielded by the dynamic targets, so the algorithm is superior to other three V-SLAM algorithms.
B. Multi-map build performance testing
In the process of testing the DE-SLAMM multi-map construction performance, in order to better show that the DE-SLAMM algorithm can continue tracking and mapping after the tracking loss, two comparison typical tracking loss test video sequences fr1-floor are adopted for carrying out experiments, and performance comparison is still carried out with the 3 algorithms.
fr1-floor video sequence, which is a fast moving camera in a room, this experiment mainly tests whether the algorithm can continue tracking and mapping when it encounters a tracking problem.
The experimental result of the ORB-SLAM2 algorithm is shown in fig. 13, where a blue rectangle box represents a key frame of the tracking process, red and black points represent established map points, and after the ORB-SLAM2 algorithm is initialized, feature points in the environment are extracted to start tracking and mapping, and when the algorithm is lost, a relocation mode is entered, but the algorithm cannot return to the previous relocated position, so that the subsequent tracking and mapping cannot be continued.
As shown in fig. 14, when the DynaSLAM algorithm is used for tracking and mapping, a static object in the environment is erroneously detected as a dynamic object, which causes a tracking loss phenomenon, and the dynamic object enters a relocation mode after the tracking loss, so that the dynamic object cannot be continuously tracked and mapped.
As shown in fig. 15, as an experimental result of the ORBSLAMM algorithm, the algorithm starts to extract feature points in the environment to start tracking and mapping after initialization, after a map 1 is established, the algorithm is lost, and then the ORBSLAMM algorithm is reinitialized again and extracts feature points in the environment to continue tracking and mapping, and a map 2 is established.
The experimental result of the DE-SLAMM algorithm of the invention is shown in FIG. 16, and after the algorithm is initialized, a local map M is established1The algorithm can occasionally generate the tracking loss phenomenon, but can be reinitialized soon to establish a new local map M2Continuing to track and build the map, and when detecting the loop frame, building the new local map M2And map M1The algorithm can continuously track and build a map when encountering the problem of tracking and losing, is superior to an ORB-SLAM2 algorithm and a DynaSLAM algorithm, and can also fuse two established local maps into a complete global map, so that the performance of the algorithm is superior to the performance of the first three algorithms when encountering the problem of tracking and losing.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims (7)

1. A positioning and map construction method based on monocular vision is characterized by comprising the following steps:
(1) processing the video frame of the created map by adopting a Mask R-CNN neural network, and segmenting a priori dynamic targets in the environment to obtain segmented image frames;
(2) positioning the segmented image frames in a map using a low cost tracking module;
(3) the segmented image frames processed by the steps are tracked, detected and positioned by a multi-view geometric module;
(4) carrying out background restoration based on time weighted filtering on the background of the part, which is shielded by the dynamic target, in the map;
(5) and (4) acquiring the maps established in the steps (1) to (4), when map tracking failure occurs, generating a new local map in a self-adaptive manner, and fusing the map and the previously established map to realize a multi-map construction thread during loop returning.
2. The monocular vision based positioning and mapping method according to claim 1, wherein in the step (3), the method using the multi-view geometry module comprises:
s11, firstly, calculating the depth of each map point projected from the previous frame to the pixel point of the current frame;
s12, calculating the parallax angle of the back projection of the pixel point of the previous frame and the current frame, comparing the viewing angle with a set threshold, if the viewing angle is larger than the threshold, determining the pixel point as a dynamic point, and meanwhile, judging by adopting the reprojection error of the pixel point of the previous frame and the current frame, if the error is larger than the set threshold, determining the pixel point as a dynamic point;
and S13, removing all detected dynamic points to generate a new segmented image frame for tracking detection and positioning.
3. The monocular vision based positioning and mapping method of claim 2, wherein the method of detecting the dynamic point comprises:
calculating the depth z of the projection of the pixel point x of the previous frame to the pixel point x' of the current frame by using a triangulation methodprojThe following are:
Figure FDA0003089482090000011
in the formula:
Figure FDA0003089482090000012
-an antisymmetric matrix of x';
calculating a parallax angle α between the back-projections of x and x';
Figure FDA0003089482090000021
in the formula:
Figure FDA0003089482090000022
in the figure
Figure FDA0003089482090000023
Vector quantity;
Figure FDA0003089482090000024
in the figure
Figure FDA0003089482090000025
And (5) vector quantity.
Judging whether the alpha angle is larger than a set threshold value beta, if so, considering that the key point is shielded, and neglecting the key point;
wherein, when there is a situation that the static point is larger than beta, the following limiting condition is added, namely, the depth z 'of the current frame pixel point x' is firstly obtained, and then the reprojection error z is calculatedprojFinally, z' is compared with zprojThe difference value of (a) z is,
Δz=zproj-z' (3)
in the formula: z is a radical ofproj-reprojection errors; z' -depth;
if Δ z exceeds the threshold τzIf not, the pixel point x 'is considered to belong to the dynamic point, otherwise, the pixel point x' is considered to belong to the dynamic point.
4. The monocular vision based localization and mapping method according to claim 1, wherein in the step (4), the method for background restoration based on temporal weighted filtering comprises:
backtracking the key frame images at the previous n moments to repair the background, making the fusion weights occupied by the key frames at different times different, wherein the weight of the key frame closer to the current frame is larger, so that the background repair method comprises the following steps:
Figure FDA0003089482090000026
in the formula: KFSi—tiA key frame of a moment; KFSc-a current key frame;
the above formula uses n key frames to synthesize one key frame for background repair.
5. The monocular vision-based localization and mapping method of claim 4, wherein when there are a few vacant parts of the synthesized key frame that are not repaired, the method of computing comprises:
when the pixel position is ith1~i2Go to j1~j2If the column is not repaired, setting a smoothing threshold k for smoothing filtering, wherein the calculation formula is as follows:
Figure FDA0003089482090000027
in the formula: u. ofi,j-pixel values of ith row and jth column of the image before image inpainting; u'i,j-pixel values of ith row and jth column of the image after image restoration.
6. The monocular vision based positioning and mapping method of claim 1, wherein in the step (5), the method of multiple mapping threads comprises:
s21, creating a first map M0And its local key frame database KFS0And the data is transmitted into a multi-map construction thread, and then a tracking thread, a local map construction thread and a loop detection thread are used for map M0Tracking and constructing a map, wherein the multi-map construction thread is in an idle state on the premise that the tracking thread is not lost;
s22, after the nth tracking thread is lost, the tracking thread creates a new map MnAnd key frame database KFSnAnd transmitting the data to a global map M and a global key frame database in a multi-map construction thread;
s23, the trace thread attempts to reinitialize, and after the initialization succeeds,the threads will be notified to transfer the task of tracking and mapping to the local map MnIn operation, at the moment, the multi-map construction thread scans the local key frame database KFS of the new mapnThe key frames stored in the database are matched with the key frames stored in the previous global key frame database;
s24, traversing all previous local key frame databases KFS0~KFSn-1And calculating the minimum similarity score between all the previous key frames and the current key frame by using the following formula (6) to inquire whether the current key frame K is similar to the previous key frame K in the previous mapcMatching key frames;
Figure FDA0003089482090000031
s25, for each Mi∈[M0~Mn-1]Keyframe K in a local mapjIf it is combined with KcHaving more than fifteen matching points, let the solver calculate the similarity transformation between them, or for each KjPerforming a random sampling consistency iteration until K with enough matching points is foundjOr all candidate frames fail; if a similarity transformation can be returned after the random sampling consistency iteration, the similarity transformation can be optimized, and if enough matching points still exist after the optimization, K isjIs considered to be a loop key frame, and all are at KjAnd the map points seen in its adjacent frames will all be at KcDetecting and reprojecting, searching more matched frames by using the calculated similarity transformation, and if the point number corresponding to all the matched frames exceeds a threshold value, considering the matched frames as a loop;
and S26, carrying out map fusion on the map detected to be looped back and the current map, optimizing the camera pose, and finally generating a new global map M'.
7. The monocular vision based positioning and mapping method of claim 6, wherein the map fusion method comprises:
first calculate KcAnd KjOf the similarity transformation matrix ScwReuse of ScwMap of local area MnAnd map M for generating a loopiConnected together by loop fusion, if the two maps are fused for the first time, M is retrievednOtherwise only K is retrievedcThen K is added to the adjacent frame and map pointcSet as ScwCorrecting, and converting the pose of each detected key frame into a map M by the following equationiCoordinates in the coordinate system:
Tic=Tiw*Twc (7)
Tcorr=Tic*Twc (8)
in the formula: t isiw-key frame poses retrieved before correction; t iswcK before correctioncThe reverse posture of (1); t iscorrAt MiAnd correcting the pose of the searched key frame in the coordinate system.
CN202110591607.1A 2021-05-28 2021-05-28 Positioning and map construction method based on monocular vision Active CN113298904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110591607.1A CN113298904B (en) 2021-05-28 2021-05-28 Positioning and map construction method based on monocular vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110591607.1A CN113298904B (en) 2021-05-28 2021-05-28 Positioning and map construction method based on monocular vision

Publications (2)

Publication Number Publication Date
CN113298904A true CN113298904A (en) 2021-08-24
CN113298904B CN113298904B (en) 2022-12-02

Family

ID=77325815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110591607.1A Active CN113298904B (en) 2021-05-28 2021-05-28 Positioning and map construction method based on monocular vision

Country Status (1)

Country Link
CN (1) CN113298904B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115526811A (en) * 2022-11-28 2022-12-27 电子科技大学中山学院 Adaptive vision SLAM method suitable for variable illumination environment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463909A (en) * 2014-11-28 2015-03-25 北京交通大学长三角研究院 Visual target tracking method based on credibility combination map model
CN109387204A (en) * 2018-09-26 2019-02-26 东北大学 The synchronous positioning of the mobile robot of dynamic environment and patterning process in faced chamber
CN111402336A (en) * 2020-03-23 2020-07-10 中国科学院自动化研究所 Semantic S L AM-based dynamic environment camera pose estimation and semantic map construction method
US10813715B1 (en) * 2019-10-16 2020-10-27 Nettelo Incorporated Single image mobile device human body scanning and 3D model creation and analysis
WO2021082771A1 (en) * 2019-10-29 2021-05-06 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Augmented reality 3d reconstruction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463909A (en) * 2014-11-28 2015-03-25 北京交通大学长三角研究院 Visual target tracking method based on credibility combination map model
CN109387204A (en) * 2018-09-26 2019-02-26 东北大学 The synchronous positioning of the mobile robot of dynamic environment and patterning process in faced chamber
US10813715B1 (en) * 2019-10-16 2020-10-27 Nettelo Incorporated Single image mobile device human body scanning and 3D model creation and analysis
WO2021082771A1 (en) * 2019-10-29 2021-05-06 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Augmented reality 3d reconstruction
CN111402336A (en) * 2020-03-23 2020-07-10 中国科学院自动化研究所 Semantic S L AM-based dynamic environment camera pose estimation and semantic map construction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
齐咏生 孙作慧 李永亭 刘利强: "基于ISRCDKF 的移动机器人同时定位与建图研究", 《农业机械学报》, 5 September 2019 (2019-09-05) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115526811A (en) * 2022-11-28 2022-12-27 电子科技大学中山学院 Adaptive vision SLAM method suitable for variable illumination environment

Also Published As

Publication number Publication date
CN113298904B (en) 2022-12-02

Similar Documents

Publication Publication Date Title
CN108960211B (en) Multi-target human body posture detection method and system
CN107808111B (en) Method and apparatus for pedestrian detection and attitude estimation
CN111563442A (en) Slam method and system for fusing point cloud and camera image data based on laser radar
Boniardi et al. Robot localization in floor plans using a room layout edge extraction network
CN106952288B (en) Based on convolution feature and global search detect it is long when block robust tracking method
US6826292B1 (en) Method and apparatus for tracking moving objects in a sequence of two-dimensional images using a dynamic layered representation
Luo et al. Real-time dense monocular SLAM with online adapted depth prediction network
WO2017206005A1 (en) System for recognizing postures of multiple people employing optical flow detection and body part model
Concha et al. RGBDTAM: A cost-effective and accurate RGB-D tracking and mapping system
Sock et al. Multi-task deep networks for depth-based 6d object pose and joint registration in crowd scenarios
CN111739144A (en) Method and device for simultaneously positioning and mapping based on depth feature optical flow
CN111354022A (en) Target tracking method and system based on kernel correlation filtering
Lamarca et al. Camera tracking for slam in deformable maps
CN113298904B (en) Positioning and map construction method based on monocular vision
Wang et al. Effective multiple pedestrian tracking system in video surveillance with monocular stationary camera
Hu et al. Multiple maps for the feature-based monocular SLAM system
Dai et al. RGB‐D SLAM with moving object tracking in dynamic environments
Amer Voting-based simultaneous tracking of multiple video objects
Welponer et al. Monocular depth prediction in photogrammetric applications
CN115131407B (en) Robot target tracking method, device and equipment oriented to digital simulation environment
CN113570713B (en) Semantic map construction method and device for dynamic environment
Suttasupa et al. Plane detection for Kinect image sequences
Samdurkar et al. Overview of Object Detection and Tracking based on Block Matching Techniques.
CN108534797A (en) A kind of real-time high-precision visual odometry method
Wang et al. Stream query denoising for vectorized hd map construction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant