CN111724439A

CN111724439A - Visual positioning method and device in dynamic scene

Info

Publication number: CN111724439A
Application number: CN201911200881.0A
Authority: CN
Inventors: 姜昊辰; 张晓林; 李嘉茂; 刘衍青; 朱冬晨; 彭镜铨
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-09-29
Anticipated expiration: 2039-11-29
Also published as: CN111724439B

Abstract

The invention relates to the technical field of robot navigation positioning, in particular to a visual positioning method and a visual positioning device in a dynamic scene, wherein the method comprises the following steps: acquiring a current frame image, and extracting feature points of the current frame image; inputting the current frame image into a preset deep learning network for semantic segmentation to obtain a target semantic image; determining a motion mask area of the current frame image according to the target semantic image; acquiring depth information of the current frame image; performing motion consistency detection based on the target semantic image and the depth information, and determining a static feature point set of the current frame image; and determining the current state pose information according to the static feature point set. The method and the device perform motion consistency detection through semantic segmentation results and depth information, determine the static characteristic point set of the image, and can effectively improve the accuracy and robustness of pose estimation in a dynamic environment.

Description

Visual positioning method and device in dynamic scene

Technical Field

The invention relates to the technical field of robot navigation positioning, in particular to a visual positioning method and device in a dynamic scene.

Background

With the development of artificial intelligence technology, more and more intelligent mobile robots appear in various scenes in production and life. From an industrial robot to a housekeeping service robot, and from an unmanned aerial vehicle to an underwater detection robot, an important condition of intellectualization is that the robot can move autonomously, namely, the autonomous navigation of the robot is realized. In order to realize autonomous movement in various environments, two basic problems need to be solved, namely positioning and Mapping, and the core of the intelligent mobile robot is a Simultaneous positioning and Mapping (SLAM) technology.

SLAM technology can be largely classified into laser SLAM and visual SLAM depending on the type of sensor. Visual SLAM technology has been extensively studied in recent years due to the richness of images in information storage, and the service type of images for some higher level of work (such as semantic segmentation and object detection). The existing visual SLAM technology is usually a complete framework, comprises parts such as feature extraction, loopback detection and the like, and has obtained better test results under certain environments. However, the existing visual SLAM technology based on point features is based on static environment assumption, and for most actual scenes, an absolute static scene does not exist, so that the accuracy of pose estimation is sharply reduced or even the technology cannot work when the technology is located in a dynamic environment. Meanwhile, since the moving object is not judged, the appearance of artifacts can be caused when the dense point cloud map is reconstructed, so that the environment is wrongly perceived.

Disclosure of Invention

In view of the foregoing problems in the prior art, an object of the present invention is to provide a visual positioning method and apparatus in a dynamic scene, which can improve accuracy and robustness of pose estimation in a dynamic environment.

In order to solve the above problem, the present invention provides a visual positioning method in a dynamic scene, including:

acquiring a current frame image, and extracting feature points of the current frame image;

inputting the current frame image into a preset deep learning network for semantic segmentation to obtain a target semantic image;

determining a motion mask area of the current frame image according to the target semantic image;

acquiring depth information of the current frame image;

performing motion consistency detection based on the target semantic image and the depth information, and determining a static feature point set of the current frame image;

and determining the current state pose information according to the static feature point set.

Further, the performing motion consistency detection based on the target semantic image and the depth information, and determining the set of static feature points of the current frame image includes:

determining a background area of the current frame image according to the target semantic image;

determining first attitude information according to the characteristic points of the background area of the current frame image;

determining types of feature points of the motion mask region based on the first pose information and the depth information, the types including dynamic feature points and static feature points;

removing dynamic characteristic points in the characteristic points of the motion mask area and reserving static characteristic points;

and generating a static characteristic point set according to the static characteristic points and the characteristic points of the background area.

Further, the determining the type of feature points of the motion mask region based on the first pose information and the depth information comprises:

calculating a motion score of a feature point of the motion mask region according to the first pose information and the depth information;

when the motion score is smaller than a preset threshold value, judging that the feature point is a static feature point;

and when the motion score is greater than or equal to a preset threshold value, judging the characteristic point as a dynamic characteristic point.

Specifically, the calculating a motion score of a feature point of the motion mask region according to the first pose information and the depth information includes:

acquiring feature points of a motion mask area of a first reference frame image, and matching the feature points of the motion mask area of the first reference frame image with the feature points of the motion mask area of the current frame image to obtain matching point pairs; wherein, the first reference frame image is a previous frame image of the current frame image;

screening the matching point pairs, and removing the matching point pairs which are mismatched;

and calculating the distance between the screened matching point pairs according to the first position information and the depth information, and taking the distance as the motion score of the characteristic points of the motion mask area.

Further, after determining the motion mask region of the current frame image according to the target semantic image, the method further includes:

acquiring a first motion mask area of a first reference frame image and a second motion mask area of a second reference frame image; the first reference frame image is a frame image before the current frame image, and the second reference frame image is a frame image before the first reference frame image;

judging whether the current frame image has missing detection according to the first motion mask area and the motion mask area of the current frame image;

if the current frame image has missing detection, determining a first target motion mask area according to the first motion mask area and the second motion mask area;

and replacing the motion mask area of the current frame image with the first target motion mask area.

Preferably, the method further comprises:

if the current frame image has no missing detection, determining a second target motion mask area according to the first motion mask area and the motion mask area of the current frame image;

and replacing the motion mask area of the current frame image with the second target motion mask area.

Further, after the obtaining the depth information of the current frame image, the method further includes:

and repairing the depth information by using a preset morphological method.

Further, the determining the current state pose information according to the set of static feature points includes:

when a new key frame is generated, establishing data association among the feature points in the static feature point set, the key frame and the map point;

determining second attitude information according to the feature points in the static feature point set;

and performing pose optimization according to the second pose information and the data association to determine the pose information of the current state.

Further, the method further comprises:

generating static object point cloud data based on the target semantic image, the current state pose information and the static feature point set;

and performing dense reconstruction of the static object point cloud map according to the static object point cloud data.

Another aspect of the present invention provides a visual positioning apparatus in a dynamic scene, including:

the first acquisition module is used for acquiring a current frame image and extracting the characteristic points of the current frame image;

the semantic segmentation module is used for inputting the current frame image into a preset deep learning network for semantic segmentation to obtain a target semantic image;

the determining module is used for determining a motion mask area of the current frame image according to the target semantic image;

the second acquisition module is used for acquiring the depth information of the current frame image;

the detection module is used for carrying out motion consistency detection based on the target semantic image and the depth information and determining a static characteristic point set of the current frame image;

and the positioning module is used for determining the current state pose information according to the static characteristic point set.

Due to the technical scheme, the invention has the following beneficial effects:

according to the visual positioning method under the dynamic scene, the image is subjected to semantic segmentation, the motion consistency is detected according to the semantic segmentation result and the depth information, the dynamic feature points of a motion mask area are removed, the static feature points in the motion mask are added into a static feature point set, the pose optimization is realized by utilizing the static feature point set, and a dense static object point cloud map is reconstructed. The accuracy and robustness of pose estimation in a dynamic environment can be improved, and the accuracy of the static object point cloud map is improved, so that the accuracy of environment perception is improved.

In addition, the visual positioning method under the dynamic scene is based on the assumption of motion continuity, the semantic segmentation results of the adjacent frame images are fused, the condition of missing segmentation is compensated, and the accuracy of the method can be further improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the embodiment or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram of a visual positioning system in a dynamic scene according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for visual localization in a dynamic scene according to an embodiment of the present invention;

FIG. 3 is a flow chart of a visual positioning method in a dynamic scene according to another embodiment of the present invention;

FIG. 4 is a flow chart of a visual positioning method in a dynamic scene according to another embodiment of the present invention;

FIG. 5 is a flow chart of a visual positioning method in a dynamic scene according to another embodiment of the present invention;

FIG. 6A is a diagram illustrating the test results of the ORB-SLAM2 system according to one embodiment of the present invention;

FIG. 6B is a diagram illustrating test results of a DS-SLAM system according to an embodiment of the present invention;

FIG. 6C is a diagram illustrating test results of a visual positioning method in a dynamic scene according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a visual positioning apparatus in a dynamic scene according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

In order to make the objects, technical solutions and advantages disclosed in the embodiments of the present invention more clearly apparent, the embodiments of the present invention are described in further detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the embodiments of the invention and are not intended to limit the embodiments of the invention. First, the embodiments of the present invention explain the following concepts:

ORB-SLAM2 System: an open source SLAM system for monocular, binocular and RGBD cameras. The ORB-SLAM system is a real-time monocular SLAM system based on feature points and can operate in large-scale, small-scale, indoor and outdoor environments. The system is also robust to strenuous exercise, supporting wide baseline closed loop detection and relocation, including full automatic initialization. The ORB-SLAM2 system also supports calibrated binocular cameras and RGBD cameras on the basis of the ORB-SLAM system.

RGBD: red Green Blue Deep, three-channel color picture and depth information.

DS-SLAM System: a semantic visual SLAM system for dynamic scenes. The DS-SLAM system is proposed based on the ORB-SLAM2 system, and combines a semantic segmentation network with a mobile consistency check method, so that the influence of dynamic objects is reduced, and the positioning precision is greatly improved in a dynamic environment.

Referring to the specification and fig. 1, the embodiment provides a visual positioning system in a dynamic scene, which may include a semantic segmentation and compensation thread 110, a tracking estimation thread 120, a local map thread 130, a loop detection thread 140, and a dense point cloud mapping thread 150.

The semantic segmentation and compensation thread 110 outputs pixel-by-pixel semantic classification results for the input picture by a neural network method such as semantic segmentation, and based on a motion continuity assumption, semantic segmentation results for adjacent frames are fused to compensate for missing segmentation.

The tracking estimation thread 120 calculates the relative motion relationship of the static background by extracting the feature points on the current frame image, collecting semantic information, then calculates the feature point matching relationship of the motion mask region, removes the feature points which do not satisfy the matching relationship, judges the dynamic and static conditions of the feature points by combining the relative motion relationship in a motion consistency detection and depth map weighting mode, eliminates the dynamic feature points, and updates the static feature point set.

The local map thread 130 performs local pose adjustment based on a common view by applying a local sliding window manner to a plurality of key frames and feature points, and is a secondary optimization process for the pose on the basis of the tracking estimation thread 120.

The loop detection thread 140 compares the condition of each frame with all the key frames, and performs global optimization and screening on all the key frames once when finding similar conditions.

The dense point cloud mapping thread 150 removes the dynamic object region through the operations of the last threads, and uses methods such as ICP to splice point clouds according to the estimated relative pose with stable robustness, so as to obtain a dense point cloud map of a static object.

Referring to the specification and fig. 2, the present embodiment provides a visual positioning method in a dynamic scene, which may include the following steps:

s210: and acquiring a current frame image, and extracting the characteristic points of the current frame image.

In the embodiment of the invention, the characteristic points of the current frame image can be extracted by a tracking estimation thread by using an ORB characteristic extraction method, and a characteristic description method of an ORB descriptor is adopted. In some possible embodiments, other methods of characterizing features may be used, as the invention is not limited in this respect.

S220: and inputting the current frame image into a preset deep learning network for semantic segmentation to obtain a target semantic image.

In the embodiment of the invention, the semantic segmentation can be carried out on the current frame image through a semantic segmentation and compensation thread. The preset deep learning network can comprise an ENet semantic segmentation network, wherein the ENet semantic segmentation network is a more common segmentation network, has a simple network structure, has quick running time and few variables, and can be applied to real-time image segmentation and mobile terminal equipment.

In practical application, firstly, the identification mapping processing can be carried out according to a three-channel color image sequence to form a training set corresponding to a training label; an asymmetric network structure is adopted, so that the decoding end can conveniently perform up-sampling fine adjustment on the encoding end; replacing a linear rectification function (relu) layer in the network with a parameter correction linear unit (PReLUs), and adding additional parameters of a characteristic diagram; replacing the convolution layer in the bottleneck structure (bottleeck) with a cavity convolution and connecting in series to increase the receptive field; a spatial random discard (Dropout) process is used to prevent overfitting.

In some possible embodiments, other semantic segmentation networks may be used, which is not limited by the present invention.

S230: and determining a motion mask area of the current frame image according to the target semantic image.

In the embodiment of the invention, the motion mask area can be determined through a semantic segmentation and compensation thread. The motion mask region may include a region of a potentially moving object, which may include a person, an animal, etc.

In one possible embodiment, as shown in fig. 3, after determining the motion mask region of the current frame image according to the target semantic image, the method may further include:

s310: acquiring a first motion mask area of a first reference frame image and a second motion mask area of a second reference frame image; the first reference frame image is a frame image before the current frame image, and the second reference frame image is a frame image before the first reference frame image.

In the embodiment of the present invention, the first motion mask region of the first reference frame image and the second motion mask region of the second reference frame image may be determined by performing semantic segmentation through a preset deep learning network.

S320: and judging whether the current frame image has missing detection according to the first motion mask area and the motion mask area of the current frame image.

In the embodiment of the invention, whether the current frame image is subjected to omission or not can be determined according to the intersection ratio by calculating the intersection ratio of the motion mask of the current frame image and the motion mask of the first reference frame image.

S330: and if the current frame image has omission, determining a first target motion mask area according to the first motion mask area and the second motion mask area.

In the embodiment of the present invention, if the current frame image has missing detection, the first motion mask region of the first reference frame image and the second motion mask region of the second reference frame image may be projected pixel by pixel into the current frame to obtain the first target motion mask region.

In particular, assume that

Is a motion mask for the first reference frame image,

motion mask for the second reference frame image, let us assume

For the motion mask of the current frame image, the intersection ratio of the motion mask of the current frame image and the motion mask of the first reference frame image is counted as D_iouThen D is_iouCan be defined as:

suppose that_iouIs D_iouIf D is calculated_iouValue less than_iouAnd indicating that the semantic meaning of the current frame image has omission. At this time, the

And

the motion mask is projected into the current frame pixel by pixel, and a new semantic detection result is calculated and recorded as S_iouThen S is_iouCan be defined as:

the intersection is taken because the subsequent motion consistency detection system can put the feature points meeting the condition back into the static point set, so that the expansion does not influence the precision of the system.

S340: and replacing the motion mask area of the current frame image with the first target motion mask area.

In the embodiment of the invention, after the motion mask area of the current frame image is determined, whether the current frame image is missed to be segmented or not can be detected through a semantic segmentation and compensation thread, and when the current frame image is judged to be missed to be detected, the semantic segmentation results of the first two frames of the current frame image are fused for compensation.

In another possible embodiment, as shown in fig. 3, the method may further include:

s350: and if the current frame image has no missing detection, determining a second target motion mask area according to the first motion mask area and the motion mask area of the current frame image.

In the embodiment of the present invention, if the current frame image does not have missing detection, the first motion mask region of the first reference frame image may be projected into the current frame pixel by pixel, and combined with the motion mask region of the current frame image to obtain the second target motion mask region.

S360: and replacing the motion mask area of the current frame image with the second target motion mask area.

In the embodiment of the invention, when the omission is judged not to occur, the semantic segmentation result of the previous frame image of the current frame image can be used for compensation, so that the possibility of omission is further reduced.

S240: and acquiring the depth information of the current frame image.

In a possible embodiment, after obtaining the depth information of the current frame image, the method may further include:

and repairing the depth information by using a preset morphological method.

In the embodiment of the invention, the depth information of the current frame image can be restored by adopting methods such as expansion operation, interpolation and the like through a tracking estimation thread.

S250: and performing motion consistency detection based on the target semantic image and the depth information, and determining a static characteristic point set of the current frame image.

In the embodiment of the invention, the motion consistency detection can be carried out through the tracking estimation thread, the dynamic characteristic points of the moving object in the current frame image are removed, and the static characteristic point set of the current frame image is obtained.

In one possible embodiment, as shown in fig. 4, the performing motion consistency detection based on the target semantic image and the depth information, and determining the set of static feature points of the current frame image may include:

s410: and determining the background area of the current frame image according to the target semantic image.

In the embodiment of the present invention, the background area may include a static object area, and the static object may include a road, a tree, a building, and the like.

S420: and determining first position information according to the characteristic points of the background area of the current frame image.

In this embodiment of the present invention, the determining the first pose information according to the feature point of the background region of the current frame image may include: acquiring feature points of a background area of a first reference frame image, and matching the feature points of the background area of the first reference frame image with the feature points of the background area of the current frame image to obtain matching point pairs; wherein, the first reference frame image is a previous frame image of the current frame image; screening the matching point pairs through a distance constraint and random sample consensus (RANSAC) algorithm to remove mismatching matching point pairs; and calculating the first position information by using the screened matching point pairs through a normalization eight-point method.

Specifically, assume that the set of feature points of the background region of the first reference frame image is B_i-1The feature point set of the background region of the current frame image is B_iSet B to_i-1And set B_iThe feature points in the description are matched to obtain a matching point pair set, the distance between descriptors is calculated, and the RANSAC algorithm is adopted to match the matching pointsAnd carrying out secondary screening on the matching point pairs in the matching point pair set to obtain a stable point set corresponding relation. The calculation method of the first posture information may include:

feature point set B from background region_iAnd B_i-1Determining a matching point pair set, selecting 8 matching point pairs from the matching point pair set, and estimating a basic matrix F by using a normalization eight-point method_i；

Calculating the distances d from the rest of the matched point pairs in the matched point pair set to the corresponding epipolar lines thereof_nIf d is_n<d, the point is an inner point, otherwise, the point is an outer point, and the number of the inner points which meet the condition is recorded as m_iWherein d is a preset distance threshold;

iterate S times, or obtain the number m of inner points_iThe total proportion of the set is greater than or equal to a preset proportion (for example, 95%), and the iteration is stopped. Selection of m_iMaximum basis matrix F_iAs the first pose.

S430: determining types of feature points of the motion mask region based on the first pose information and the depth information, the types including dynamic feature points and static feature points.

In one possible embodiment, as shown in fig. 5, the determining the type of the feature point of the motion mask region based on the first pose information and the depth information may include:

s431: and calculating the motion score of the characteristic point of the motion mask region according to the first position and posture information and the depth information.

S432: and when the motion score is smaller than a preset threshold value, judging that the characteristic point is a static characteristic point.

S433: and when the motion score is greater than or equal to a preset threshold value, judging the characteristic point as a dynamic characteristic point.

In another possible embodiment, the calculating the motion score of the feature point of the motion mask region according to the first pose information and the depth information may include:

In the embodiment of the invention, the motion score of the feature point of the motion mask region can be calculated by using a constraint mode comprising epipolar constraint, depth constraint and the like. In some possible embodiments, other constraint manners may be adopted, and the present invention is not limited to this.

Specifically, after the matching point pair is obtained, the matching point pair may be screened by using a distance constraint and a RANSAC algorithm, so as to obtain a screened matching point pair.

Let p be₁And p₂A set of matching pairs of points, p, being characteristic points of a motion mask₁Characteristic points, p, of motion mask regions of said first reference frame image₂For the feature points of the motion mask region of the current frame image, d₁And d₂Is p₁And p₂Corresponding depth values on the depth map. Deriving p from multi-view geometry₁And p₂The epipolar constraint should be satisfied, whose expression is as follows:

and F is a basic matrix obtained by calculating the characteristic points of the background area. Since the actual value of the object is not 0 due to the motion of the object, the error distance is obtained. Meanwhile, the current depth value can be obtained by carrying out perspective transformation and pose transformation on the first reference frame image feature point and carrying out reprojection, and the difference of the depth values is taken as a distance error, wherein the specific formula is as follows:

wherein D represents the distance and is a corresponding threshold, and if the calculated D value is greater than or equal to the threshold, the characteristic point p is judged₂Is a dynamic feature point, if the D value is less than the D value, the feature point p is judged₂Are static feature points.

S440: and eliminating dynamic characteristic points in the characteristic points of the motion mask area, and reserving static characteristic points.

In this embodiment of the present invention, the feature points of the motion mask region may be counted as a set, and when the feature points are determined to be dynamic feature points, the feature points are deleted from the set, and when the feature points are determined to be static feature points, the feature points are retained.

S450: and generating a static characteristic point set according to the static characteristic points and the characteristic points of the background area.

In the embodiment of the invention, the feature points of all the motion mask regions can be traversed, and the static feature point set is generated according to the reserved static feature points and the feature points of the background region.

S260: and determining the current state pose information according to the static feature point set.

In one possible embodiment, the determining current state pose information from the set of static feature points may include:

In the embodiment of the invention, the data association among the feature points, the key frames and the map points in the static feature point set can be established through the tracking estimation thread; and performing pose optimization through a local map thread to determine the pose information of the current state. Specifically, a basis matrix may be calculated by using a normalized eight-point method according to a matching point pair composed of the feature point in the static feature point set and the feature point in the first reference frame image matched with the feature point, so as to obtain the second pose information. And performing pose optimization by using the second pose information as an initial value and utilizing a local bundling technology to determine the pose information of the current state.

In one possible embodiment, the method may further include:

In the embodiment of the invention, dense reconstruction of the static object point cloud map can be carried out through the dense point cloud map building process.

In a possible embodiment, local optimization and key frame screening can be performed on all key frames through a local map thread, including corresponding addition and deletion of key frames and map points, so that estimation of the system can be more accurate and stable, and the semantic segmentation and compensation thread and the local map thread can perform information interaction to perform optimization of the rear-end pose together. The optimized key frame can be globally adjusted and loop detected through a loop detection thread. Pose estimation and dynamic area judgment can be carried out through a dense point cloud mapping process, a static background part is generated into local point cloud, and integral splicing and updating are completed; the tracking estimation thread, the semantic segmentation and compensation thread and the dense point cloud mapping thread can carry out information interaction, the pixels of the detected part of the moving object are removed according to the semantic segmentation result, the accurate pose obtained by the pose optimization is combined, and dense reconstruction of the point cloud map of the static object is realized through an ICP method.

In a specific embodiment, the practical effects of the embodiment of the invention are verified through the internationally open source data set. The data set is an RGBD indoor data set open at the Technical University of Munich (TUM), 8 high-speed cameras (100hz) are adopted for shooting, and a real motion track of the data set is provided through a motion capture system. The data set comprises the original image and the corresponding depth map, and provides an alignment script between the original image and the depth map and an evaluation script used for evaluating the pose estimation accuracy of the SLAM system. Compared with the ORB-SLAM2 system and the DS-SLAM system, the invention has the comparison indexes of Absolute Track Error (ATE) and Relative Position Error (RPE). The test results are shown in fig. 6A to 6C, wherein the left graph is the comparison result of the actual trajectory, including the true trajectory, the trajectory and the error line segment therebetween, and the right graph is the curve of the PRE variation with time, describing the relative stability of the system. The result shows that the method of the embodiment of the invention is superior to an ORB-SLAM2 system and a DS-SLAM system in effect, and the system can reach 22FPS due to the adoption of the ENet network, thereby completely meeting the requirement of real-time property.

In summary, the visual positioning method in a dynamic scene of the present invention has the following beneficial effects:

Referring to fig. 7 in the specification, the present embodiment provides a visual positioning apparatus 700 in a dynamic scene, where the apparatus 700 may include:

a first obtaining module 710, configured to obtain a current frame image and extract feature points of the current frame image;

a semantic segmentation module 720, configured to input the current frame image into a preset deep learning network for semantic segmentation to obtain a target semantic image;

a determining module 730, configured to determine a motion mask region of the current frame image according to the target semantic image;

a second obtaining module 740, configured to obtain depth information of the current frame image;

a detection module 750, configured to perform motion consistency detection based on the target semantic image and the depth information, and determine a static feature point set of the current frame image;

and a positioning module 760 for determining the current state pose information according to the static feature point set.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above.

The foregoing description has disclosed fully preferred embodiments of the present invention. It should be noted that those skilled in the art can make modifications to the embodiments of the present invention without departing from the scope of the appended claims. Accordingly, the scope of the appended claims is not to be limited to the specific embodiments described above.

Claims

1. A visual positioning method in a dynamic scene is characterized by comprising the following steps:

acquiring depth information of the current frame image;

2. The method of claim 1, wherein the performing motion consistency detection based on the target semantic image and the depth information, and wherein determining the set of static feature points of the current frame image comprises:

3. The method of claim 2, wherein the determining the type of feature points for the motion mask region based on the first pose information and the depth information comprises:

4. The method of claim 3, wherein the calculating motion scores for feature points of the motion mask region from the first pose information and the depth information comprises:

5. The method according to claim 1 or 2, wherein after determining the motion mask region of the current frame image according to the target semantic image, the method further comprises:

6. The method of claim 5, further comprising:

7. The method according to claim 1 or 2, wherein after obtaining the depth information of the current frame image, the method further comprises:

and repairing the depth information by using a preset morphological method.

8. The method according to claim 1 or 2, wherein the determining current state pose information from the set of static feature points comprises:

9. The method of claim 8, further comprising:

10. A visual positioning apparatus for dynamic scenes, comprising: