CN114332394A

CN114332394A - Semantic information assistance-based dynamic scene three-dimensional reconstruction method

Info

Publication number: CN114332394A
Application number: CN202111648384.4A
Authority: CN
Inventors: 崔林艳; 郭政航
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-12

Abstract

The invention relates to a semantic information assistance-based dynamic scene three-dimensional reconstruction method, which comprises the following steps of: (1) the visual sensor collects images, semantic segmentation and point and line feature extraction are carried out on the collected images, and static point and line features are screened out through image semantic segmentation results; (2) calculating the pose of a camera by combining the static point and line characteristics obtained in the step (1), and establishing a static object point cloud model in a dynamic scene; (3) combining the static object point cloud model in the dynamic scene obtained in the step (2), fusing the image semantic segmentation result in the step (1), mapping the image semantic labels to three-dimensional space points, and updating semantic categories of all map points in the step (2) by adopting a Bayesian filtering framework to obtain a semantic point cloud map; (4) and (4) clustering point clouds of the same category by combining the semantic point cloud map in the step (3), determining the pose of an object by adopting a minimum bounding box method, obtaining a semantic point cloud model of the dynamic scene, and solving the problem of semantic mapping of the dynamic scene.

Description

Semantic information assistance-based dynamic scene three-dimensional reconstruction method

Technical Field

The invention relates to a semantic information assistance-based dynamic scene three-dimensional reconstruction method, which is suitable for semantic map reconstruction in a dynamic scene.

Background

The real-time three-dimensional map reconstruction technology is widely applied to the fields of military affairs, detection, unmanned driving, robots and the like, and the technology is developed particularly rapidly in the fields of mobile robots, unmanned driving and the like in recent years. The map can be constructed in real time, and the semantic information of the map is the key research direction of the map construction technology at the present stage, so that the method has very important significance. Under the condition that the robot has no environment prior information, the self pose of the robot is estimated through the acquired sensor data, and a globally consistent environment map is constructed at the same time, namely a simultaneous localization and mapping (SLAM) technology is constructed. Wherein a visual sensor based SLAM system is referred to as a visual SLAM. In practical applications, the visual SLAM system can be used to fulfill a variety of requirements, including positioning, reconstruction presentation, navigation obstacle avoidance, and machine-to-environment interaction. To fulfill these requirements, visual SLAM systems need to build different forms of maps. However, in order to meet the positioning requirement, the traditional visual SLAM system establishes a sparse point cloud map, and is difficult to meet complex tasks such as automatic driving and man-machine interaction. With the continuous expansion of the application field of the robot and the continuous deepening of the application level, high intellectualization becomes a necessary trend of robot development, and understanding the semantic features of the environment is a precondition for realizing man-machine interaction, reducing cognitive gap and completing complex work. However, the conventional map construction method lacks environment semantic information, and the theoretical development and the practical application of the map construction method are severely restricted. In addition, the traditional map reconstruction technology has poor capability of coping with dynamic scenes, dynamic objects in the scenes can randomly shield the scenes in actual use, however, the traditional method cannot distinguish the dynamic and static states of the objects in the scenes, and a map can be constructed according to the traditional static world assumption, so that a series of motion trail residual images are left in the map by the dynamic objects in the scenes, the human visual experience is seriously influenced, and the judgment of the mobile robot on the self environment condition is also seriously interfered.

Aiming at the dynamic scene semantic mapping research, the current research difficulty is mainly expressed in the following aspects: (1) dynamic objects in a dynamic environment seriously interfere with camera pose resolving and influence point cloud splicing, which is the most fundamental link of image construction; (2) in the existing semantic segmentation network based on deep learning, at the position where an object is shielded or motion blur is generated due to the motion of the object, the segmentation result is often not accurate enough, which can seriously affect the data association between a semantic label and a map point; (3) the existing semantic mapping method is often difficult to determine the orientation of objects in a dynamic scene, and difficulty is caused for further development of semantic auxiliary positioning.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method overcomes the defects of the prior art, aims at the semantic mapping problem of the dynamic scene, provides a dynamic scene three-dimensional reconstruction method based on semantic information assistance, enhances the semantic perception capability of a visual positioning system to the dynamic environment, and improves the positioning accuracy of the visual positioning system in the dynamic environment.

The technical solution of the invention is as follows: a semantic information assistance-based dynamic scene three-dimensional reconstruction method comprises the following implementation steps:

(1) the visual sensor collects images, semantic segmentation and point and line feature extraction are carried out on the collected images, and static point and line features are screened out through image semantic segmentation results;

(2) calculating the pose of a camera by combining the static point and line characteristics obtained in the step (1), and establishing a static object point cloud model in a dynamic scene;

(3) combining the static object point cloud model in the dynamic scene obtained in the step (2), fusing the image semantic segmentation result in the step (1), mapping the image semantic labels to three-dimensional space points, and updating semantic categories of all map points in the step (2) by adopting a Bayesian filtering framework to obtain a semantic point cloud map;

(4) and (4) clustering point clouds of the same category by combining the semantic point cloud map in the step (3), determining the pose of an object by adopting a minimum bounding box method, obtaining a semantic point cloud model of the dynamic scene, and solving the problem of semantic mapping of the dynamic scene.

In the step (1), the method for acquiring the image by the visual sensor, performing semantic segmentation and point and line feature extraction on the acquired image, and screening out static point and line features according to a semantic segmentation result comprises the following steps:

performing semantic segmentation and feature extraction on the acquired image synchronously, and performing pixel-level semantic segmentation on the RGB image by adopting a classic SegNet encoder decoder network in a semantic segmentation module to segment dynamic objects in a scene; in the feature extraction module, firstly, the RGB image is grayed, and ORB feature points and line features are simultaneously extracted from the grayscale image, wherein the line features are extracted by adopting an LSD line feature extractor.

On the basis, combining the semantic segmentation result of the current image obtained from the semantic segmentation thread, and detecting dynamic ORB feature points and LSD line features partially positioned on the dynamic object in the current image according to the semantic segmentation result, wherein the semantic label judgment setting rule of the line features is as follows: a line feature is considered to be a static line feature only if both end points of the line feature are not on a dynamic object. And removing the characteristic points or characteristic lines with dynamic semantic labels, and only keeping the characteristics of the static points and the lines for calculating the camera pose corresponding to the current image.

In the step (2), the camera pose is calculated by combining the static point and line characteristics obtained in the step (1), and the method for establishing the static object point cloud model in the dynamic scene comprises the following steps:

and obtaining the accurate pose of the camera in the dynamic scene through the point and line tracking thread according to the static point and line characteristics obtained by screening. In order to reduce the calculation amount in the three-dimensional reconstruction process, the method screens out a proper key frame by counting the quantity of point features and line features which can be tracked by a tracking thread in an image. When the number of tracked feature points of a certain frame image exceeds 40 and the number of tracked line features exceeds 10, the frame is set as a key frame.

And then mapping the key frame image to a three-dimensional space through the corresponding depth image to obtain the three-dimensional point cloud of the key frame. And then carrying out point cloud splicing on the point cloud maps generated by the two continuous frames through the conversion relation between the two frames obtained by calculation, and finally establishing a static object point cloud model in the dynamic scene.

In the step (3), the static object point cloud model in the dynamic scene is established by combining the point cloud model obtained in the step (2), the semantic segmentation result in the step (1) is fused, the image semantic label is mapped to the three-dimensional space point, and the semantic category of all map points in the step (2) is updated by adopting a Bayesian filtering frame, so that the semantic point cloud map is obtained by the following method:

mapping the semantic labels to the three-dimensional point cloud map through internal and external parameters of the camera to obtain an initial semantic point cloud map by combining the semantic segmentation network segmentation result obtained in the step (1) and the point cloud map obtained in the step (2);

in consideration of the situation that semantic segmentation results between continuous frame images are inconsistent, the invention adopts a Bayesian filtering mode to fuse semantic information between different frames, and each frame of image semantic segmentation data is obtained by performing incremental data fusion on semantic categories of map points by using a Bayesian filtering frame.

Suppose that the current frame is the k-th frame and the current frame image is I_kSemantic segmentation of network pairs I using SegNet_kThe result of the segmentation is S_kLet's x_kFor three-dimensional map points X in image I_kUpper corresponding pixel point, L_kSemantic Categories for three-dimensional map points X, l_kIs x_kIn semantic segmentation of the result S_kCorresponding semantic type in (1), P (L)_k|S_k) Representing a semantic class probability distribution P (L) for a three-dimensional map point X_k) At the current segmentation result S_kConditional probability under conditions, using P (l)_k|S_k) Representing a two-dimensional map point X corresponding to a current three-dimensional map point X_kSemantic class probability distribution P (l)_k) At the current segmentation result S_kConditional probability under conditions:

P(L_k|S_k)＝P(l_k|S_k)

wherein l_kThe definition field of (1) is that the semantic segmentation network can distinguish the categories including people, books, displays and keyboards. Given all the semantic segmentation data S at the current time₁,S₂,…S_kIntegrating the data to obtain the posterior probability distribution of the semantic category of the map point X, and obtaining the probability distribution of the semantic category of the map point X adopted by the inventionA three-dimensional map point semantic label updating method based on Bayesian estimation comprises the following steps:

wherein P (L) on the left side of the equation_k|S₁,S₂,…S_k) Segmentation result S representing the comprehensive utilization of all image data at time k₁,S₂,...,S_kProbability distribution of semantic class of Back map Point X, P (l)_k|S_k) An observed value, P (L), representing a probability distribution of semantic classes of map points X obtained from the result of semantic segmentation of the image of the kth frame_k) Is the probability distribution of the semantic class of the map point X at time k. And continuously updating the probability distribution of the semantic categories of the map points X by the recursion formula.

In the step (4), the point clouds of the same category are clustered by combining the semantic point cloud map in the step (3), the pose of an object is determined by adopting a minimum bounding box method, a semantic point cloud model of the dynamic scene is obtained, and the method for solving the problem of semantic mapping of the dynamic scene is as follows:

and (4) clustering adjacent point clouds with the same semantic label in the three-dimensional space into a class through K-means clustering on the three-dimensional point cloud map with the semantic information obtained in the step (3), and performing regular body fitting on each class of point clouds obtained by clustering by adopting an OBB bounding box method (Orientedb bounding box). And calculating an included angle between a three-dimensional map line parallel to the horizontal plane of the world coordinate system in the bounding box and a main shaft of the bounding box. And taking the minimum value of the included angle as the correction quantity of the main shaft of the bounding box, and correcting the direction of the bounding box to obtain the accurate orientation of the bounding box in a world coordinate system. And under a world coordinate system, calculating the coordinate polar value difference of the category space point cloud along the main axis direction of the bounding box to obtain the size of the bounding box. And the finally constructed semantic map simultaneously comprises a bounding box with a semantic label and a spatial point cloud with the semantic label.

Compared with the prior art, the invention has the advantages that:

(1) aiming at the specific application of semantic mapping of a dynamic scene, the invention gives map semantic information through the assistance of semantic information, improves the precision of building a three-dimensional semantic map by adding line features in pose estimation, enhances the understanding of a robot to the environment where the robot is located, and opens up a new space for the theoretical development and the practical application of the robot in the future.

(2) Compared with the conventional mapping method based on semantic information assistance, the method has higher precision and system stability in low-texture scenes and dynamic scenes, so that the method is more applicable to wider environments.

In a word, the method adopted by the invention is simple in principle, and can achieve the purpose of performing real-time semantic reconstruction on the dynamic low-texture scene.

Drawings

FIG. 1 is a flow chart of a dynamic scene three-dimensional reconstruction method based on semantic information assistance in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.

As shown in fig. 1, the specific implementation steps of the present invention are as follows:

step 1, a vision sensor collects images, semantic segmentation and point and line feature extraction are carried out on the collected images, and static point and line features are screened out through image semantic segmentation results. And synchronously performing semantic segmentation and feature extraction on the acquired images. In the semantic segmentation module, a classic SegNet encoder decoder network is adopted to carry out pixel-level semantic segmentation on the RGB image, and dynamic objects in the scene are segmented. In the feature extraction module, the RGB image is grayed, and ORB feature points and line features are simultaneously extracted from the grayscale image, wherein the line features are extracted by adopting an LSD line feature extractor.

And 2, calculating the pose of the camera by combining the static point and line characteristics obtained in the step 1, and establishing a static object point cloud model in the dynamic scene. And obtaining the accurate pose of the camera in the dynamic scene through the point and line tracking thread according to the static point and line characteristics obtained by screening. In order to reduce the calculation amount in the three-dimensional reconstruction process, the method screens out a proper key frame by counting the quantity of point features and line features which can be tracked by a tracking thread in an image. When the number of tracked feature points of a certain frame image exceeds 40 and the number of tracked line features exceeds 10, the frame is set as a key frame.

And 3, combining the static object point cloud model in the dynamic scene established in the step 2, fusing the semantic segmentation result in the step 1, mapping the image semantic labels to the three-dimensional space points, and updating the semantic categories of all the map points in the step 2 by adopting a Bayesian filtering framework to obtain a semantic point cloud map. Mapping the semantic labels to the three-dimensional point cloud map through internal and external parameters of the camera to obtain an initial semantic point cloud map by combining the semantic segmentation network segmentation result obtained in the step (1) and the point cloud map obtained in the step (2);

P(L_k|S_k)＝P(l_k|S_k)

wherein l_kThe definition field of (1) is that the semantic segmentation network can distinguish the categories including people, books, displays and keyboards. Given all the semantic segmentation data S at the current time₁,S₂,…S_kAnd merging the data to obtain the posterior probability distribution of the semantic category of the map point X, and obtaining the three-dimensional map point semantic label updating method based on Bayesian estimation, which is adopted by the invention:

wherein P (L) on the left side of the equation_k|S₁,S₂,…S_k) Segmentation result S representing the comprehensive utilization of all image data at time k₁,S₂,...,S_kProbability distribution of semantic class of Back map Point X, P (l)_k|S_k) An observed value, P (L), representing a probability distribution of semantic classes of map points X obtained from the result of semantic segmentation of the image of the kth frame_k) For map point X at time kProbability distribution of semantic categories. And continuously updating the probability distribution of the semantic categories of the map points X by the recursion formula.

And 4, clustering point clouds of the same category by combining the semantic point cloud map in the step 3, determining the pose of an object by adopting a minimum bounding box method to obtain a semantic point cloud model of the dynamic scene, and solving the problem of semantic mapping of the dynamic scene. And (4) clustering adjacent point clouds with the same semantic label in the three-dimensional space into a class through K-means clustering on the three-dimensional point cloud map with the semantic information obtained in the step (3), and performing regular body fitting on each class of point clouds obtained by clustering by respectively adopting an OBB bounding box method (ordered bounding box). And calculating an included angle between a three-dimensional map line parallel to the horizontal plane of the world coordinate system in the bounding box and a main shaft of the bounding box. And taking the minimum value of the included angle as the correction quantity of the main shaft of the bounding box, and correcting the direction of the bounding box to obtain the accurate orientation of the bounding box in a world coordinate system. And under a world coordinate system, calculating the coordinate polar value difference of the category space point cloud along the main axis direction of the bounding box to obtain the size of the bounding box. And the finally constructed semantic map simultaneously comprises a bounding box with a semantic label and a spatial point cloud with the semantic label.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. A semantic information assistance-based dynamic scene three-dimensional reconstruction method is characterized by comprising the following steps:

(3) combining the static object point cloud model in the dynamic scene established in the step (2), fusing the semantic segmentation result in the step (1), mapping the image semantic labels to three-dimensional space points, and updating the semantic categories of all map points in the step (2) by adopting a Bayesian filtering framework to obtain a semantic point cloud map;

(4) and (4) clustering the point clouds of the same category by combining the semantic point cloud map in the step (3), and determining the pose of the object by adopting a minimum bounding box method to obtain a semantic point cloud model of the dynamic scene.

2. The semantic information assistance-based dynamic scene three-dimensional reconstruction method according to claim 1, characterized in that: in the step (1), the method for acquiring the image by the visual sensor, performing semantic segmentation and point and line feature extraction on the acquired image, and screening out static point and line features according to the image semantic segmentation result comprises the following steps:

performing semantic segmentation and feature extraction on the acquired image synchronously, performing pixel-level semantic segmentation on the RGB image by adopting a classic SegNet encoder decoder network in a semantic segmentation module to segment dynamic objects in a scene, performing graying on the RGB image in a feature extraction module, and simultaneously extracting ORB feature points and line features in the grayscale image, wherein the line features are extracted by adopting an LSD line feature extractor;

on the basis, combining the semantic segmentation result of the current image obtained from the semantic segmentation thread, and detecting dynamic ORB feature points and LSD line features partially positioned on the dynamic object in the current image according to the semantic segmentation result, wherein the semantic label judgment setting rule of the line features is as follows: and only when the two end points of the line feature are not on the dynamic object, the line feature is determined as a static line feature, the feature points or feature lines with dynamic semantic labels are removed, and only the static point and line feature are reserved for calculating the camera pose corresponding to the current image.

3. The semantic information assistance-based dynamic scene three-dimensional reconstruction method according to claim 1, characterized in that: in the step (2), the camera pose is calculated by combining the static point and line characteristics obtained in the step (1), and the method for establishing the static object point cloud model in the dynamic scene comprises the following steps:

obtaining the accurate pose of the camera in the dynamic scene through the point and line tracking thread according to the static point and line characteristics obtained by screening; in order to reduce the calculation amount in the three-dimensional reconstruction process, proper key frames are screened out by counting the number of point features and line features tracked by a tracked thread in an image, and when the number of tracked feature points of a certain frame of image exceeds 40 and the number of tracked line features exceeds 10, the frame is set as a key frame;

and then mapping the key frame image to a three-dimensional space through a corresponding depth image to obtain a three-dimensional point cloud of the key frame, performing point cloud splicing on point cloud maps generated by two continuous frames through a conversion relation between the two frames obtained through calculation, and finally establishing a static object point cloud model in a dynamic scene.

4. The semantic information assistance-based dynamic scene three-dimensional reconstruction method according to claim 1, characterized in that: in the step (3), the static object point cloud model in the dynamic scene is established by combining the point cloud model obtained in the step (2), the semantic segmentation result in the step (1) is fused, the image semantic label is mapped to the three-dimensional space point, and the semantic category of all map points in the step (2) is updated by adopting a Bayesian filtering frame, so that the semantic point cloud map is obtained by the following method:

in consideration of the situation that semantic segmentation results between continuous frame images are inconsistent, semantic information between different frames is fused in a Bayesian filtering mode, and the acquired semantic segmentation data of each frame image is subjected to incremental data fusion on semantic categories of map points by using a Bayesian filtering frame;

let the current frame be the kth frame and the current frame image be I_kSemantic segmentation of network pairs I using SegNet_kThe result of the segmentation is S_kLet's x_kFor three-dimensional map points X in image I_kUpper corresponding pixel point, L_kSemantic Categories for three-dimensional map points X, l_kIs x_kIn semantic segmentation of the result S_kCorresponding semantic type in (1), P (L)_k|S_k) Semantic class probability distribution P (L) representing three-dimensional map point X_k) At the current segmentation result S_kConditional probability under conditions, using P (l)_k|S_k) Representing a two-dimensional map point X corresponding to a current three-dimensional map point X_kSemantic class probability distribution P (l)_k) At the current segmentation result S_kConditional probability under conditions:

P(L_k|S_k)＝P(l_k|S_k)

wherein l_kThe definition field of (1) is that the semantic segmentation network can distinguish the categories including people, books, displays and keyboards. Given all the semantic segmentation data S at the current time₁,S₂,…S_kAnd merging the data to obtain the posterior probability distribution of the semantic categories of the map points X, and obtaining the adopted three-dimensional map point semantic label updating method based on Bayesian estimation:

wherein P (L) on the left side of the equation_k|S₁,S₂,…S_k) Segmentation result S representing the comprehensive utilization of all image data at time k₁,S₂,...,S_kProbability distribution of semantic class of Back map Point X, P (l)_k|S_k) An observed value, P (L), representing a probability distribution of semantic classes of map points X obtained from the result of semantic segmentation of the image of the kth frame_k) And continuously updating the probability distribution of the semantic categories of the map points X at the moment k through the recursion formula.

5. The semantic information assistance-based dynamic scene three-dimensional reconstruction method according to claim 1, characterized in that: in the step (4), the point clouds of the same category are clustered by combining the semantic point cloud map in the step (3), and the pose of the object is determined by adopting a minimum bounding box method, so that a semantic point cloud model of the dynamic scene is obtained by the following method:

clustering adjacent point clouds with the same semantic label in a three-dimensional space into a class through K-means clustering, performing regular body fitting on each class of point clouds obtained by clustering by adopting an OBB bounding box method (Oriented bounding box), calculating an included angle between a three-dimensional map line parallel to the horizontal plane of a world coordinate system and a bounding box main shaft in the bounding box, taking the minimum value of the included angle as the correction amount of the bounding box main shaft, and correcting the direction of the bounding box to obtain the accurate orientation of the bounding box in the world coordinate system; under a world coordinate system, the size of the bounding box is obtained by calculating the coordinate polar value difference of the category space point cloud in the direction along the main axis of the bounding box, and the finally constructed semantic map simultaneously comprises the bounding box with the semantic label and the three-dimensional space point cloud with the semantic label.