CN113298904A

CN113298904A - Monocular vision-based positioning and map construction method

Info

Publication number: CN113298904A
Application number: CN202110591607.1A
Authority: CN
Inventors: 齐咏生; 陈培亮; 刘利强; 李永亭; 董朝铁
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-08-24
Anticipated expiration: 2041-05-28
Also published as: CN113298904B

Abstract

The invention provides a positioning and map building method based on monocular vision, which comprises the following steps: (1) processing the video frame of the created map by adopting a Mask R-CNN neural network, and segmenting a priori dynamic targets in the environment to obtain segmented image frames; (2) positioning the segmented image frames in a map using a low cost tracking module; (3) the segmented image frames processed by the steps are tracked, detected and positioned by a multi-view geometric module; (4) carrying out background restoration based on time weighted filtering on the background of the part, which is shielded by the dynamic target, in the map; (5) and (4) acquiring the maps established in the steps (1) to (4), when map tracking failure occurs, generating a new local map in a self-adaptive manner, and fusing the map and the previously established map to realize a multi-map construction thread during loop returning. The method can effectively extract various dynamic targets in the environment.

Description

Monocular vision-based positioning and map construction method

Technical Field

The invention relates to the field of monocular visual positioning and map building, in particular to a monocular visual positioning and map building method.

Background

As the development of intelligent technology has prompted the development of intelligent mobile robot technology to a new stage, synchronous positioning and mapping (SLAM) technology, as a basic capability of intelligent mobile robots, has become a hot issue in the field of robot research. Before 2000, the SLAM algorithm of the mobile robot is mainly realized by using the laser radar, and after 2000, the study gradually turns to the SLAM algorithm based on machine vision.

There are three main types of cameras used in the visual SLAM algorithm, which are: monocular cameras, binocular cameras, and RGB-D cameras. The monocular camera has the advantages of low price, low power consumption, small volume, convenience in installation, large amount of acquired information, suitability for large-scale scenes and the like, so that the monocular camera-based V-SLAM algorithm is widely applied. To date, some more classical monocular SLAM (V-SLAM) algorithms have been developed, such as: MonoSLAM, PTAM, ORB-SLAM, etc.

However, the conventional V-SLAM is based on the laser radar, but the SLAM algorithm based on the laser radar can only establish a two-dimensional planar map, the map information is incomplete, and is limited by various environmental factors, and various Visual synchronous positioning and Mapping (V-SLAM for short) algorithms gradually emerge in the last decade, but the conventional V-SLAM algorithm has various defects, such as: after the tracking is lost, the images cannot be continuously built, the correct environment map cannot be built under the dynamic environment and the like, so that the traditional V-SLAM algorithm is difficult to realize the synchronous positioning and image building of the robot in a complex and changeable actual scene, and the traditional algorithm has two main problems: (1) the problem that an environment real map cannot be established in a dynamic environment; (2) the problem that the V-SLAM algorithm cannot continue to build graphs after a lost follow-up (i.e., similar to "kidnapping" of a robot).

Disclosure of Invention

The invention aims to provide a positioning and map building method based on monocular vision, which provides a solution for two problems in the prior art, detects various dynamic targets in the environment by introducing a method combining deep learning and multi-view geometry technology, detects the prior dynamic targets in the environment by using a Mask R-CNN neural network, detects various moving random dynamic targets in the environment by using the multi-view geometry technology on the basis, does not track and build the map for the dynamic targets in the tracking and map building process, and can overcome various influences of dynamic objects on an algorithm; then, synthesizing an image frame for repairing a background map by using a background repairing algorithm based on time weighted filtering, and performing smooth filtering to realize background repairing on the background shielded by the dynamic object; and finally, a multi-map construction thread is designed by utilizing the thought of multi-map construction, so that the problem that the traditional V-SLAM algorithm cannot continue to track and construct a map after the algorithm is lost is solved, and compared with the traditional V-SLAM algorithm, the method has stronger robustness, the constructed map is more complete, the adaptability to a dynamic environment is stronger, and the method has better application value.

The invention provides a monocular vision-based positioning and map building method, which comprises the following steps:

(1) processing the video frame of the created map by adopting a Mask R-CNN neural network, and segmenting a priori dynamic targets in the environment to obtain segmented image frames;

(2) positioning the segmented image frames in a map using a low cost tracking module;

(3) the segmented image frames processed by the steps are tracked, detected and positioned by a multi-view geometric module;

(4) carrying out background restoration based on time weighted filtering on the background of the part, which is shielded by the dynamic target, in the map;

(5) and (4) acquiring the maps established in the steps (1) to (4), when map tracking failure occurs, generating a new local map in a self-adaptive manner, and fusing the map and the previously established map to realize a multi-map construction thread during loop returning.

Firstly, the monocular vision-based positioning and map building method introduces a dynamic object detection mechanism combining deep learning and multi-view geometry based on RCNN, so that the two methods are combined because of the advantages of the two methods, and different target conditions are solved. Deep learning has very good detection accuracy for a priori dynamic targets, but the detection rate is lower for some accidental or unlearned dynamic targets, such as: if a person moves with a book, the person is a priori dynamic target, and the book is not a priori dynamic target, the existence of the book is difficult to detect through deep learning. The multi-scale geometric technology just makes up for the defects, and because the multi-scale geometric technology utilizes the space geometric scale to calculate the pose, the multi-scale geometric technology is sensitive to the moving target. However, the multi-view geometry technique has the disadvantage that it hardly detects a dynamic object for a slowly varying or temporarily stationary object, such as a temporarily stationary dynamic object person.

Secondly, aiming at the problem of background restoration in map construction, the invention provides a multi-frame fusion algorithm based on time weighted filtering, and the multi-frame fusion algorithm is used for restoring the background of the map of the shielded part of the dynamic object.

Finally, the invention introduces a multi-map construction idea. Relocation in a conventional V-SLAM algorithm (e.g., ORB-SLAM2 algorithm) is a greedy search process that uses the current frame to match all previous key frames, which is time consuming, labor consuming, and prone to fall into dead-loops. Therefore, the invention proposes to adopt local multi-map construction to replace a relocation link in the traditional algorithm, namely when algorithm tracking loss occurs, a new local map is directly established to realize the continuous tracking and map construction after the algorithm tracking loss.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic diagram of a dynamic detection and background repair process algorithm in the present invention;

FIG. 2 is a schematic diagram of a static object and dynamic object detection algorithm in the present invention (left diagram: static object; right diagram: dynamic object);

FIG. 3 is a diagram of a background repair algorithm in accordance with the present invention;

FIG. 4 is a schematic diagram of a multi-map construction thread algorithm in the present invention;

FIG. 5 shows the result of running the ORB-SLAM2 algorithm in the fr 3-shaping-xyz sequence;

FIG. 6 shows the result of the ORBSLAMM algorithm of the present invention running in the fr 3-shaping-xyz sequence;

FIG. 7 shows the results of running the DynasLAM algorithm of the present invention on the fr 3-shaping-xyz sequence;

FIG. 8 shows the result of the DE-SLAMM algorithm of the present invention running in the fr 3-shaping-xyz sequence;

FIG. 9 shows the result of the ORB-SLAM2 algorithm running on the fr3-walking-xyz sequence;

FIG. 10 shows the result of the ORBSLAMM algorithm in the present invention running on fr3-walking-xyz sequence;

FIG. 11 shows the results of running the DynasLAM algorithm of the present invention in the fr3-walking-xyz sequence;

FIG. 12 shows the result of the DE-SLAMM algorithm of the present invention running in the fr3-walking-xyz sequence;

FIG. 13 shows the result of running the ORB-SLAM2 algorithm in the fr1-floor sequence in accordance with the present invention;

FIG. 14 shows the results of running the DynasLAM algorithm of the present invention in the fr1-floor sequence;

FIG. 15 shows the results of the ORBSLAMM algorithm of the present invention operating in the fr1-floor sequence;

FIG. 16 shows the operation result of the DE-SLAMM algorithm in the present invention in the fr1-floor sequence;

fig. 17 is a flowchart illustrating a monocular visual positioning and mapping method according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

The monocular vision-based positioning and map building method specifically comprises the following steps:

A. dynamic target detection and background repair thread:

the dynamic detection and background repair process mainly comprises 4 parts: a Mask R-CNN neural network module, a low-cost tracking module, a multi-view geometry module and a background restoration module, as shown in FIG. 1.

1) Dynamic target detection

As shown in fig. 1, a Mask R-CNN neural network is used to process a video frame to obtain a segmented frame 1 and a segmented frame 2 for segmenting a priori dynamic target.

However, the priori dynamic target cannot cover all dynamic targets, and therefore, a low-cost tracking module needs to be designed for simplifying a tracking thread, a camera is positioned in a created scene map by adopting the low-cost tracking module, map points generated by a local mapping thread are re-projected into a segmented image frame from the map scene, feature points are searched in the image frame, feature points in a static area are reserved, the feature points in a dynamic area are deleted, and the image frame is transmitted into a multi-view geometric module.

The low-cost tracking module is a module obtained by simplifying a tracking thread, positions the camera in the created scene map by adopting the low-cost tracking module, and lays a cushion (such as the chair) for detecting potential dynamic objects. Firstly, acquiring an image frame processed by Mask R-CNN, and positioning a camera to an established map scene; secondly, re-projecting map points generated by a local mapping process into the segmented image frame from the map scene, searching feature points in the image frame, reserving feature points of a static area, and deleting feature points of a dynamic area; finally, the image frame is passed to a multi-view geometry module.

And detecting the random object moving by using a multi-view geometric module. The method comprises the following specific steps:

a) the input frame is selected to have the key frame with the highest degree of overlap, and in order to fully consider the distance between the new frame and each key frame and the rotation requirement, a threshold (set to 5 in the present invention) is set for the number of overlapped key frames.

b) Calculating the depth z of the projection of the pixel point x of the previous frame to the pixel point x' of the current frame by using a triangulation method_projThe following are:

in the formula:

-an antisymmetric matrix of x'.

The parallax angle alpha between the back-projections of x and x' is calculated.

In the formula:

in the figure

Vector quantity;

in the figure

And (5) vector quantity.

It is judged whether or not the angle α is larger than a set threshold value β (30 ° in this example). If the threshold β is too large, a weakly moving object may not be detected, and if the threshold β is too small, an immobile object may be detected as a moving object. If the value is larger than beta, the key point is considered to be possibly blocked, and then the key point is ignored. However, sometimes the static point may be larger than β, so a limitation condition needs to be added, that is, first obtaining the depth z 'of the current frame pixel point x', and then calculating the reprojection error z_projFinally, z' is compared with z_projThe difference value of (a) z is,

Δz＝z_proj-z' (3)

in the formula: z is a radical of_proj-reprojection errors; z' -depth.

If Δ z exceeds the threshold τ_zIf yes, the pixel point x 'is considered to belong to the dynamic object, otherwise, the pixel point x' is considered to belong to the dynamic object. Therefore, as long as Δ z > τ is satisfied_zAnd if so, regarding the pixel point as a dynamic point, and neglecting the corresponding dynamic point subsequently. Fig. 2 is a schematic diagram illustrating detection of static objects and dynamic objects.

c) After the dynamic object is correctly judged, the characteristic points contained in the dynamic object are removed to generate a segmentation frame, and the segmentation frame is transmitted to a tracking thread for tracking.

2) Background remediation

Fig. 3 shows a multi-frame fusion background repair algorithm based on time weighted filtering. From the current moment, backtrack the key frame images of the previous n moments to carry out background restoration, make the fusion weights occupied by the key frames at different times different, the weight of the key frame closer to the current frame is larger, then:

in the formula: KFS_i—t_iA key frame of a moment; KFS_c-current key frame.

This allows n key frames to be used to synthesize a key frame for background repair. However, there is still a possibility that a small amount of the blank part of the synthesized key frame is not repaired, and at this time, the pixels of the synthesized key frame may be smoothed to repair the small amount of the unrepaired part, and the calculation process is as follows:

if the pixel position is ith₁～i₂Go to j₁～j₂If the column is not repaired, setting a smoothing threshold k for smoothing filtering, wherein the calculation formula is as follows:

in the formula: u. of_i,j-pixel values of ith row and jth column of the image before image inpainting; u'_i,j-pixel values of ith row and jth column of the image after image restoration. Therefore, image restoration can be realized, and finally, the restored image is used as a key frame to build a map, so that the map background can be restored.

B. Multiple map construction threads:

the multi-map construction thread has the main functions of storing a plurality of maps established by the local construction thread and a local key frame database thereof, detecting whether a loop phenomenon exists between a current frame and a previously stored map or not, performing map fusion if a loop exists, and optimizing the camera pose at the same time, as shown in fig. 4. The specific workflow of the multi-map construction thread is as follows:

(1) when the algorithm starts to run, the tracking thread will create a first map M₀And its local key frame database KFS₀And the data is transmitted into a multi-map construction thread, and then a tracking thread, a local map construction thread and a loop detection thread are used for map M₀The multi-map building thread is in an idle state as long as the tracking thread is not lost.

(2) After the nth tracking thread is lost, the tracking thread creates a new map M_nAnd the key point thereofFrame database KFS_nAnd transmitting the data to a global map M and a global key frame database in the multi-map construction thread.

(3) The tracking thread attempts to reinitialize, and once the initialization is successful, each thread is notified to transfer the tracking and mapping tasks to the local map M_nIn operation, at the moment, the multi-map construction thread scans the local key frame database KFS of the new map_nAnd matching the stored key frames with the key frames stored in the previous global key frame database.

(4) The matching method is similar to the loop detection process, and the multi-map construction thread traverses all the local key frame databases (KFS) before traversing₀～KFS_n-1) And using formula (6) to calculate the minimum similarity score between all previous key frames and the current key frame to inquire whether the previous map has the current key frame K_cMatching key frames.

(5) For each M_i∈[M₀～M_n-1]Keyframe K in a local map_jIf it is combined with K_cHaving more than fifteen matching points, let the solver calculate the similarity transformation between them, or for each K_jPerforming random sample consensus (RANSC) iterations until K is found with enough matching points_jOr all candidate frames fail. If a similarity transformation can be returned after random sampling consistency (RANSC) iteration, the similarity transformation can be optimized, and if enough matching points remain after optimization, K is_jIs considered a loop key frame. And all are at K_jAnd the map points seen in its adjacent frames will all be at K_cDetected and re-projected, and then more matched frames are searched by using the calculated similarity transformation. If the number of points corresponding to all the matched frames exceeds the threshold value, the frame is considered to be a loop.

(6) Then the multi-map construction thread performs map fusion on the map with the loop detection and the current map,and optimizing the pose of the camera, and finally generating a new global map M'. The map fusion method comprises the following steps: the multi-map construction thread calculates K first_cAnd K_jOf the similarity transformation matrix S_cwReuse of S_cwMap of local area M_nAnd map M for generating a loop_iConnected together by loop fusion (loop fusion in same loop detection), if the two maps are fused for the first time, M is retrieved_nOtherwise only K is retrieved_cThen K is added to the adjacent frame and map point_cSet as S_cwCorrecting, and converting the pose of each detected key frame into a map M by the following equation_iCoordinates in the coordinate system:

T_ic＝T_iw*T_wc (7)

T_corr＝T_ic*T_wc (8)

in the formula: t is_iw-key frame poses retrieved before correction; t is_wcK before correction_cThe reverse posture of (1); t is_corrAt M_iAnd correcting the pose of the searched key frame in the coordinate system.

In summary, the map points of each retrieved key frame and its neighboring frames are corrected to map M_iCoordinates in a coordinate system, then K_jAnd its neighboring frame map points projected to K_cAnd its neighboring frames, thus completing the fusion between the maps. See figure 17 for details of the overall workflow.

In addition, in order to verify the effectiveness of the algorithm, the algorithm is tested by adopting a plurality of collected typical video sequence tests, and the method comprises the steps of initializing and tracking speed, constructing a plurality of maps after the algorithm is lost, detecting the effect and effectiveness problems in a dynamic environment and the like. Finally, the DE-SLAMM algorithm proposed by the present invention was compared in performance with typical classes of V-SALM algorithms (e.g., ORB-SLAM2 algorithm, ORBSLAMM algorithm, and DynaSLAM algorithm). The hardware environment of the experimental platform is Intel Core i7-10750 CPU @2.6GHz x6 cores and 16GB RAM, and the experimental result shows that the algorithm can run in real time at the frame rate of each sequence.

A. Dynamic target detection and background repair performance testing:

in order to test the dynamic target detection performance and the background repair performance of the algorithm, experiments are carried out by utilizing fr3-sitting-xyz and fr3-walking-xyz video sequences in the collected data set, and the experiments are compared with an ORB-SLAM2 algorithm, an ORBSLAMM algorithm and a dynaSLAM algorithm.

1) Dynamic target detection performance testing

The fr3-sitting-xyz video sequence is that the camera moves in X, Y, Z three directions, and a person in the fr3-sitting-xyz video sequence only moves on a chair with small action amplitude, so that the environment of the fr3-sitting-xyz video sequence belongs to a weak dynamic environment.

In a weak dynamic environment, a person only sits on a chair, the action amplitude is small, the camera moves in X, Y, Z three directions, the experimental results of the ORB-SLAM2 algorithm and the ORBSLAMM algorithm in the fr3-sitting-xyz video sequence are shown in FIGS. 5 and 6, when the ORB-SLAM2 algorithm and the ORBSLAMM algorithm are used for extracting characteristic points for tracking and mapping, points on a dynamic object (the person and the chair) are extracted, and the dynamic object is built into a map as a part of the map, so that the ORB-SLAM2 algorithm and the ORBSLAMM algorithm cannot segment dynamic objects in the environment in the weak dynamic environment; the experimental result of the DynaSLAM algorithm in the fr 3-shaping-xyz video sequence is shown in fig. 7, and the DynaSLAM algorithm can detect and segment a priori dynamic objects (human) in the video. Because the algorithm does not track and map the feature points extracted from the prior dynamic target, the algorithm cannot detect a non-prior dynamic target (a moving chair), and sometimes misdetection occurs, and a static object is also misdetected as a dynamic object, the DynaSLAM algorithm in a weak dynamic environment cannot segment all dynamic targets in the environment. The algorithm of the invention utilizes a Mask R-CNN neural network framework to detect and segment the prior dynamic target (human) in the environment, and then utilizes a multi-view geometry technology to detect and segment the chair in shaking, and does not extract characteristic points of the segmented dynamic target (human and the chair in moving) for tracking and mapping, as shown in figure 8. Therefore, the algorithm can almost segment all dynamic targets in the environment in a weak dynamic environment, and the dynamic target detection performance of the algorithm is superior to that of other three V-SLAM algorithms in the example.

2) Background repair Performance test

The fr3-walking-xyz video sequence is that the camera moves in X, Y, Z three directions, and the person in the fr3-walking-xyz video sequence walks around with a large action amplitude. Therefore, the environment to which the fr3-walking-xyz video sequence belongs to a strong dynamic environment.

In a strong dynamic environment, a person walks back and forth, the action amplitude is large, the camera moves in X, Y, Z three directions, the experimental results of the ORB-SLAM2 algorithm and the ORBSLAMM algorithm in the fr3-walking-xyz video sequence are shown in fig. 9 and 10, fewer feature points are extracted from dynamic objects (people and chairs) by the ORB-SLAM2 algorithm and the ORBSLAMM algorithm, and the extracted feature points of the dynamic objects can still be tracked and mapped, so that the ORB-SLAM2 algorithm cannot segment dynamic targets in the environment in the strong dynamic environment and cannot establish a real map of the environment; while the experimental results of the DynaSLAM algorithm in the fr3-walking-xyz video sequence are shown in fig. 11. The method can detect the prior dynamic target (human) in the video and segment the prior dynamic target, does not track and map the characteristic points extracted from the prior dynamic target, but cannot detect the non-prior dynamic target (moving chair), sometimes has the condition of false detection, and falsely detects the static object as the dynamic object. Thus, the DynaSLAM algorithm in a strongly dynamic environment cannot build a true map of the environment. The algorithm detects and segments the prior dynamic target (human) in the environment by using a Mask R-CNN neural network framework, detects and segments the chair in shaking by using a multi-view geometry technology, and does not track and map the extracted feature points of the segmented dynamic target (human and moving chair), as shown in FIG. 12. In addition, the algorithm of the invention also repairs the background blocked by the dynamic target in the k frame by using a background repairing algorithm of multi-frame fusion, so that the sparse map established by the algorithm is obviously more than the map points of the sparse map established in the figure 11. In conclusion, the algorithm can almost segment all dynamic targets in the environment in a strong dynamic environment and can also repair a background map shielded by the dynamic targets, so the algorithm is superior to other three V-SLAM algorithms.

B. Multi-map build performance testing

In the process of testing the DE-SLAMM multi-map construction performance, in order to better show that the DE-SLAMM algorithm can continue tracking and mapping after the tracking loss, two comparison typical tracking loss test video sequences fr1-floor are adopted for carrying out experiments, and performance comparison is still carried out with the 3 algorithms.

fr1-floor video sequence, which is a fast moving camera in a room, this experiment mainly tests whether the algorithm can continue tracking and mapping when it encounters a tracking problem.

The experimental result of the ORB-SLAM2 algorithm is shown in fig. 13, where a blue rectangle box represents a key frame of the tracking process, red and black points represent established map points, and after the ORB-SLAM2 algorithm is initialized, feature points in the environment are extracted to start tracking and mapping, and when the algorithm is lost, a relocation mode is entered, but the algorithm cannot return to the previous relocated position, so that the subsequent tracking and mapping cannot be continued.

As shown in fig. 14, when the DynaSLAM algorithm is used for tracking and mapping, a static object in the environment is erroneously detected as a dynamic object, which causes a tracking loss phenomenon, and the dynamic object enters a relocation mode after the tracking loss, so that the dynamic object cannot be continuously tracked and mapped.

As shown in fig. 15, as an experimental result of the ORBSLAMM algorithm, the algorithm starts to extract feature points in the environment to start tracking and mapping after initialization, after a map 1 is established, the algorithm is lost, and then the ORBSLAMM algorithm is reinitialized again and extracts feature points in the environment to continue tracking and mapping, and a map 2 is established.

The experimental result of the DE-SLAMM algorithm of the invention is shown in FIG. 16, and after the algorithm is initialized, a local map M is established₁The algorithm can occasionally generate the tracking loss phenomenon, but can be reinitialized soon to establish a new local map M₂Continuing to track and build the map, and when detecting the loop frame, building the new local map M₂And map M₁The algorithm can continuously track and build a map when encountering the problem of tracking and losing, is superior to an ORB-SLAM2 algorithm and a DynaSLAM algorithm, and can also fuse two established local maps into a complete global map, so that the performance of the algorithm is superior to the performance of the first three algorithms when encountering the problem of tracking and losing.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims

1. A positioning and map construction method based on monocular vision is characterized by comprising the following steps:

2. The monocular vision based positioning and mapping method according to claim 1, wherein in the step (3), the method using the multi-view geometry module comprises:

s11, firstly, calculating the depth of each map point projected from the previous frame to the pixel point of the current frame;

s12, calculating the parallax angle of the back projection of the pixel point of the previous frame and the current frame, comparing the viewing angle with a set threshold, if the viewing angle is larger than the threshold, determining the pixel point as a dynamic point, and meanwhile, judging by adopting the reprojection error of the pixel point of the previous frame and the current frame, if the error is larger than the set threshold, determining the pixel point as a dynamic point;

and S13, removing all detected dynamic points to generate a new segmented image frame for tracking detection and positioning.

3. The monocular vision based positioning and mapping method of claim 2, wherein the method of detecting the dynamic point comprises:

calculating the depth z of the projection of the pixel point x of the previous frame to the pixel point x' of the current frame by using a triangulation method_projThe following are:

in the formula:

-an antisymmetric matrix of x';

calculating a parallax angle α between the back-projections of x and x';

in the formula:

in the figure

Vector quantity;

in the figure

And (5) vector quantity.

Judging whether the alpha angle is larger than a set threshold value beta, if so, considering that the key point is shielded, and neglecting the key point;

wherein, when there is a situation that the static point is larger than beta, the following limiting condition is added, namely, the depth z 'of the current frame pixel point x' is firstly obtained, and then the reprojection error z is calculated_projFinally, z' is compared with z_projThe difference value of (a) z is,

Δz＝z_proj-z' (3)

in the formula: z is a radical of_proj-reprojection errors; z' -depth;

if Δ z exceeds the threshold τ_zIf not, the pixel point x 'is considered to belong to the dynamic point, otherwise, the pixel point x' is considered to belong to the dynamic point.

4. The monocular vision based localization and mapping method according to claim 1, wherein in the step (4), the method for background restoration based on temporal weighted filtering comprises:

backtracking the key frame images at the previous n moments to repair the background, making the fusion weights occupied by the key frames at different times different, wherein the weight of the key frame closer to the current frame is larger, so that the background repair method comprises the following steps:

in the formula: KFS_i—t_iA key frame of a moment; KFS_c-a current key frame;

the above formula uses n key frames to synthesize one key frame for background repair.

5. The monocular vision-based localization and mapping method of claim 4, wherein when there are a few vacant parts of the synthesized key frame that are not repaired, the method of computing comprises:

when the pixel position is ith₁～i₂Go to j₁～j₂If the column is not repaired, setting a smoothing threshold k for smoothing filtering, wherein the calculation formula is as follows:

in the formula: u. of_i,j-pixel values of ith row and jth column of the image before image inpainting; u'_i,j-pixel values of ith row and jth column of the image after image restoration.

6. The monocular vision based positioning and mapping method of claim 1, wherein in the step (5), the method of multiple mapping threads comprises:

s21, creating a first map M₀And its local key frame database KFS₀And the data is transmitted into a multi-map construction thread, and then a tracking thread, a local map construction thread and a loop detection thread are used for map M₀Tracking and constructing a map, wherein the multi-map construction thread is in an idle state on the premise that the tracking thread is not lost;

s22, after the nth tracking thread is lost, the tracking thread creates a new map M_nAnd key frame database KFS_nAnd transmitting the data to a global map M and a global key frame database in a multi-map construction thread;

s23, the trace thread attempts to reinitialize, and after the initialization succeeds,the threads will be notified to transfer the task of tracking and mapping to the local map M_nIn operation, at the moment, the multi-map construction thread scans the local key frame database KFS of the new map_nThe key frames stored in the database are matched with the key frames stored in the previous global key frame database;

s24, traversing all previous local key frame databases KFS₀～KFS_n-1And calculating the minimum similarity score between all the previous key frames and the current key frame by using the following formula (6) to inquire whether the current key frame K is similar to the previous key frame K in the previous map_cMatching key frames;

s25, for each M_i∈[M₀～M_n-1]Keyframe K in a local map_jIf it is combined with K_cHaving more than fifteen matching points, let the solver calculate the similarity transformation between them, or for each K_jPerforming a random sampling consistency iteration until K with enough matching points is found_jOr all candidate frames fail; if a similarity transformation can be returned after the random sampling consistency iteration, the similarity transformation can be optimized, and if enough matching points still exist after the optimization, K is_jIs considered to be a loop key frame, and all are at K_jAnd the map points seen in its adjacent frames will all be at K_cDetecting and reprojecting, searching more matched frames by using the calculated similarity transformation, and if the point number corresponding to all the matched frames exceeds a threshold value, considering the matched frames as a loop;

and S26, carrying out map fusion on the map detected to be looped back and the current map, optimizing the camera pose, and finally generating a new global map M'.

7. The monocular vision based positioning and mapping method of claim 6, wherein the map fusion method comprises:

first calculate K_cAnd K_jOf the similarity transformation matrix S_cwReuse of S_cwMap of local area M_nAnd map M for generating a loop_iConnected together by loop fusion, if the two maps are fused for the first time, M is retrieved_nOtherwise only K is retrieved_cThen K is added to the adjacent frame and map point_cSet as S_cwCorrecting, and converting the pose of each detected key frame into a map M by the following equation_iCoordinates in the coordinate system:

T_ic＝T_iw*T_wc (7)

T_corr＝T_ic*T_wc (8)