CN112562081A

CN112562081A - Visual map construction method for visual layered positioning

Info

Publication number: CN112562081A
Application number: CN202110175262.1A
Authority: CN
Inventors: 朱世强; 钟心亮; 顾建军; 姜峰; 李特
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-03-26
Anticipated expiration: 2041-02-07
Also published as: CN112562081B

Abstract

The invention discloses a visual map construction method for visual layered positioning, which comprises the following steps: 1) sequentially acquiring binocular image frame data collected by a binocular camera and determining a motion track; 2) respectively extracting a NetVLAD global descriptor, a SuperPoint characteristic point and a local descriptor from the image frame; 3) finding out the matching key points of the characteristic points of each frame of image and restoring the 3D position of each characteristic point in an incremental manner; 4) determining the optimal co-view key frame of each frame according to the 2D observation of the 3D feature points of the frame; 5) and generating visual map information finally used for visual layered positioning. The visual map construction method provided by the invention combines deep learning characteristics, the descriptive performance of the visual map is enhanced, and the generated map for visual positioning contains multi-level description information and can be used for visual global robust positioning with consistent coordinate systems.

Description

Visual map construction method for visual layered positioning

Technical Field

The invention relates to the technical field of computer vision, in particular to a visual map construction method for visual layered positioning.

Background

In recent years, with the continuous development of computer vision, SLAM technology is widely used in various fields, virtual reality, augmented reality, robotics, unmanned planes, and the like. With the continuous development of computer hardware, the real-time processing of visual information becomes possible, and the real-time positioning and map construction by using the visual information greatly reduces the price of intelligent robot products while improving the information acquisition amount.

However, most visual SLAM systems focus on online localization estimation, do not have localization function under global coordinate system, and most SLAM system outputs do not include representation of visual maps. The traditional method is based on matching of a bag-of-words model and traditional feature points, and has limited environment adaptation capability and low positioning success rate.

Disclosure of Invention

In view of the above, the invention provides a visual map construction method for visual layered positioning, which combines with deep learning domain knowledge and visual map construction method and expression form, and can construct a global consistent map and be used for visual-based layered global consistent positioning.

The invention adopts the following technical scheme: a visual map construction method for visual layered positioning comprises the following steps:

(1) sequentially acquiring binocular image frames acquired by the binocular camera and recording the motion trail of the binocular camera to obtain a binocular image sequence containing the motion trail

Wherein, in the step (A),

is shown as

The left frame image to which the frame corresponds,

is shown as

The right frame image to which the frame corresponds,

is an external parameter of the binocular camera,

is as follows

The pose of the frame left camera relative to the world coordinate system, and N is the frame number of the binocular image; using the first frame as a key frame, and aiming at the binocular image sequence

Selecting subsequent key frames to form a key frame sequence

,

Selecting a binocular image frame number N;

(2) respectively extracting NetVLAD global descriptors from key frames

SuperPoint feature points

And corresponding local descriptors

And according to SuperPoint feature points

The response values are sorted from high to low, the first 2000 feature points and the corresponding local descriptors are reserved, and for the second

Obtaining two triple description information by one key frame

(ii) a The NetVLAD global descriptor

Is composed of

Feature vectors of dimensions, the local descriptors

Is composed of

A vector of dimensions;

(3) finding out SuperPoint feature points of key frames

And incrementally restoring the 3D position of each SuperPoint feature point

；

(4) Traversing each of the steps (3)

Statistical observation of 3D position

All key frames of the points are sorted in descending order, and the first 5 key frames are taken as the optimal common-view key frame of the key frames

(ii) a And finally, corresponding the key frame sequence and each SuperPoint characteristic point to a 3D position

As visual map information.

Further, the selection of the subsequent key frame needs to satisfy the following condition: the Euclidean distance between adjacent left frame images or right frame images is larger than 0.3 meter, and the rotation angle of the adjacent left frame images or right frame images is larger than 3 degrees.

Further, the step (3) comprises the following sub-steps:

(3.1) describing information for a certain triple

SuperPoint feature points in (1)

Selecting the characteristic points which can observe SuperPoint simultaneously

Candidate frame set of

Wherein A is selected from the key frames, satisfying: pose of keyframes in the set of candidate frames C

And

the movement translation distance of the rotating shaft is less than 10m, and the rotating angle is less than 45 degrees;

(3.2) traverse each frame in the candidate frame set C, pair

The local descriptor corresponding to each feature point in

Calculating

And

and obtaining two nearest local descriptors

And

satisfy the following requirements

Then local descriptor

Corresponding to

And SuperPoint feature points

For mutually matching points, SuperPoint characteristic points are obtained after traversal is finished

Set of all match points in candidate frame set C

SuperPoint feature points

Corresponding 3D position

The associated matching information is

Establishing a constraint equation to solve according to the internal reference K of the binocular camera

The constraint equation is:

if it is obtained

If Z is negative or greater than 40m, the value is discarded

；

(3.3) traversing each SuperPoint feature point

And obtaining the 3D position corresponding to each SuperPoint characteristic point.

Compared with the prior art, the invention has the beneficial effects that: the method combines an image global descriptor NetVLAD method for image retrieval and a local image feature point and descriptor SuperPoint method based on a convolutional neural network, the combination of the two enhances the descriptive performance of a visual map, the visual localization is decoupled into global and local localization, and the generated map for visual localization comprises multi-level description information, wherein the global information comprises the size information of the map, the number of feature points and the corresponding 3D point positions, the number of key frames and NetVLAD descriptors corresponding to each frame, the local information comprises the posture of each frame key frame, the position of the feature points in each key frame and the corresponding SuperPoint descriptors, and the optimal co-view key frame index corresponding to each key frame, and the two are combined and can be used for visual global robust localization with consistent coordinate systems. On one hand, the problem that the traditional SLAM system cannot realize map multiplexing can be solved, and on the other hand, the robustness of global positioning can be improved.

Drawings

FIG. 1 is a flow chart of a visual map construction method for visual layered positioning according to the present invention;

FIG. 2 is a schematic diagram of multi-view observation recovery 3D feature points according to the present invention;

FIG. 3 is a schematic diagram of a visual map containing information according to the present invention;

FIG. 4 is a schematic view of the layered global positioning based on the visual map according to the present invention.

Detailed Description

The principles and aspects of the present invention will be further explained with reference to the drawings and the detailed description, it being understood that the illustrated embodiments are only some examples and the specific embodiments described herein are merely illustrative of the relevant invention and are not intended to limit the invention.

Fig. 1 is a schematic flow chart of a visual map construction method for visual layered positioning according to the present invention, where the visual map construction method includes the following steps:

(1) installing a binocular camera in a robot body, moving the robot and starting to acquire data, sequentially acquiring binocular image frames acquired by the binocular camera and recording the motion trail of the binocular camera to obtain a binocular image sequence containing the motion trail

Wherein, in the step (A),

is shown as

The left frame image to which the frame corresponds,

is shown as

The right frame image to which the frame corresponds,

is an external parameter of the binocular camera,

is as follows

Selecting subsequent key frames to form a key frame sequence

,

Selecting a binocular image frame number N; the selection of the subsequent key frame needs to satisfy the following conditions: the Euclidean distance between adjacent left frame images or right frame images is larger than 0.3 meter, and the rotation angle of the adjacent left frame images or right frame images is larger than 3 degrees.

(2) Extracting NetVLAD global descriptor based on convolution neural network for key frame

The NetVLAD global descriptor

Is described as

Extracting SuperPoint characteristic points of the key frame based on the convolution neural network by using the characteristic vector of the dimension

And corresponding local descriptors

The local descriptor

Is described as

A vector of dimensions; and according to SuperPoint characteristic points

Obtaining two triple description information by one key frame

。

(3) In order to find out SuperPoint characteristic points of key frames

And incrementally restoring the 3D position of each SuperPoint feature point

It is necessary to find out the sequence of key frames

The same pair

With observed 2D point locations, as shown in fig. 2, comprising the sub-steps of:

(3.1) describing information for a certain triple

SuperPoint feature points in (1)

Selecting the characteristic points which can observe SuperPoint simultaneously

Candidate frame set of

And

(3.2) traverse each frame in the candidate frame set C, pair

The local descriptor corresponding to each feature point in

Calculating

And

and obtaining two nearest local descriptors

And

satisfy the following requirements

Then local descriptor

Corresponding to

And SuperPoint feature points

Set of all match points in candidate frame set C

Corresponding 3D point sets are

(ii) a SuperPoint feature points

Corresponding 3D position

The associated matching information is

The constraint equation is:

if it is obtained

If Z is negative or greater than 40m, then discardingThe

。

(3.3) traversing each SuperPoint feature point

Obtaining a 3D position corresponding to each SuperPoint characteristic point;

(3.4) the whole

And traversing each feature point of the sequence to obtain a corresponding 3D value of the feature point. And finally obtaining basic information of the whole map:

(4) corresponding 3D position for each acquired SuperPoint feature point

Will be observed by multiple frames, traverse

Each of which is

Statistical observation of 3D position

As a visual groundThe map information, as shown in fig. 3, the final map information, which is exemplified by the left camera of the binocular camera, may be represented as map basic information and feature information, where the basic information includes map size information, the number of key frames, the number of feature points, and the number of 3D points; the feature information comprises key frames, key frame global descriptors, common view key frames, feature points, feature point descriptors and feature point corresponding sets of 3D points.

The visual map information obtained by the visual map construction method is stored and loaded on the robot, the robot is restarted and placed in the environment for positioning, and as shown in fig. 4, firstly, the characteristics of the collected image are extracted: the system comprises a NetVLAD global descriptor, a SuperPoint characteristic point and a descriptor. Comparing the NetVLAD global descriptor of the current frame with the NetVLAD set of the global descriptors in the map visual information, calculating the Euclidean distance of the NetVLAD global descriptor to obtain a key frame closest to the map, finishing the first-layer positioning of the hierarchical positioning, namely the global coarse positioning part, and indicating that the frame is to be positioned near the closest key frame and is coarse positioning. And finding out 5 optimal co-viewing key frames corresponding to the key frames, performing feature matching on the current frame and the closest key frame and the 5 optimal co-viewing key frames corresponding to the current frame by the method in the step (3.2), finally obtaining 3D-2D matching according to data association in the step (3.3), finally obtaining 6DoF postures according to PnP solution, and finishing the fine positioning function of layered positioning.

Table 1 gives, for example, a comparison of the visual mapping method of the present invention with some prior disclosed SLAM methods. The sparse and dense evaluation criteria in table 1 are the number of map 3D points, if the map is constructed by using only a small number of feature points, such as 2000 feature points selected by the scheme, the map is sparse, and if all pixel points in the selected image are constructed, the map is considered dense. In table 1, the coarse positioning mode DBoW is a bag-of-words model, which requires pre-training a dictionary of features for image retrieval, and finally outputs a word vector for coarse positioning for each image.

According to the comparison in table 1, in the prior art scheme, map reuse is basically not considered, which means that the corresponding method cannot perform globally consistent positioning, and the positioning result of each time is related to the initial position of the acquired data.

TABLE 1 comparison of the Performance of the location method of the present invention with that of the prior art method

In terms of robustness, table 2 compares the VINS-Mono scheme closest to the present invention in the collected data set. Because the positioning precision of the invention is related to the camera pose of the image construction, the rough positioning result can only be compared with the VINS-Mono, the total frame number of 7 sequences is counted, and the frame number of the image which passes through the same place and is collected for at least two times in the sequence is counted and recorded as the number of the loop frame. The 7 sequences of images contain challenging factors such as motion blur, occlusion, scene lighting changes and the like. As can be seen from the results in table 2, the present invention has a great improvement in the success rate and robustness of coarse positioning, which is a front step of fine positioning, and all of which have great improvements on the whole positioning system.

TABLE 2 comparison of the coarse positioning scheme of the present invention with the VINS-Mono method

Therefore, the visual map construction method of the invention can solve the problem that the traditional SLAM system can not realize map multiplexing on one hand, and can improve the robustness of global positioning on the other hand.

The above are merely preferred embodiments of the present invention; the scope of the invention is not limited thereto. Any person skilled in the art should be able to cover the technical scope of the present invention by equivalent or modified solutions and modifications within the technical scope of the present invention.

Claims

1. A visual map construction method for visual layered positioning is characterized in that: the method comprises the following steps:

Wherein, in the step (A),

is shown as

The left frame image to which the frame corresponds,

is shown as

The right frame image to which the frame corresponds,

is an external parameter of the binocular camera,

is as follows

Selecting subsequent key frames to form a key frame sequence

,

Selecting a binocular image frame number N;

(2) respectively extracting NetVLAD global descriptors from key frames

SuperPoint feature points

And corresponding local descriptors

And according to SuperPoint feature points

Obtaining two triple description information by one key frame

(ii) a The NetVLAD global descriptor

Is composed of

Feature vectors of dimensions, the local descriptors

Is composed of

A vector of dimensions;

(3) finding out SuperPoint feature points of key frames

And incrementally restoring the 3D position of each SuperPoint feature point

；

(4) Traversing each of the steps (3)

Statistical observation of 3D position

As visual map information.

2. The visual map construction method for visual layered positioning according to claim 1,

the selection of the subsequent key frame needs to satisfy the following conditions: the Euclidean distance between adjacent left frame images or right frame images is larger than 0.3 meter, and the rotation angle of the adjacent left frame images or right frame images is larger than 3 degrees.

3. A visual mapping method for visual layered positioning according to claim 1, wherein step (3) comprises the sub-steps of:

(3.1) describing information for a certain triple