CN113034596B

CN113034596B - Three-dimensional object detection and tracking method

Info

Publication number: CN113034596B
Application number: CN202110326833.7A
Authority: CN
Inventors: 章国锋; 鲍虎军; 张也
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2022-05-13
Anticipated expiration: 2041-03-26
Also published as: CN113034596A

Abstract

The invention discloses a three-dimensional object detection and tracking method, and belongs to the field of three-dimensional vision. The invention is divided into scanning and detection tracking. In the scanning process, a three-dimensional object to be detected and tracked is scanned, and the three-dimensional object and related information of the environment where the three-dimensional object is located are obtained by combining a visual odometer; in the detection tracking process, an improved SuperPoint algorithm is used for feature extraction and matching, and an accurate and robust three-dimensional object detection and tracking result is obtained by combining a visual odometer. The improved SuperPoint algorithm ensures accuracy rate and greatly reduces memory consumed by training by improving a loss function. The invention adaptively adjusts the descriptor dimensionality of the network output, and can obtain higher precision in the detection and tracking stages. The finally obtained three-dimensional object detection and tracking system has higher precision and robustness, and can obtain more accurate results under the conditions that an object is partially shielded, a camera moves faster, and other similar objects exist in the environment.

Description

Three-dimensional object detection and tracking method

Technical Field

The invention relates to the field of three-dimensional vision, in particular to a three-dimensional object detection and tracking method based on feature matching.

Background

The target of the three-dimensional object detection and tracking task is to perform target detection and six-degree-of-freedom pose solution on a specified three-dimensional object in continuous video frames, wherein the pose comprises three-degree-of-freedom orientation and three-degree-of-freedom position. The accurate pose solving result obtained by detection and tracking is important for multiple fields such as unmanned driving, robot control, augmented reality and the like. For example, by estimating the actual pose of an object, an automatically driven vehicle or robot can accurately predict and plan own behavior and path, and avoid collision and violation.

Academic community research work on three-dimensional object detection and tracking has continued for over ten years. For the detection of three-dimensional objects, there are roughly classified into a template matching-based method, a feature learning-based method, and a deep learning-based method.

Due to the strong expression capability and scene understanding capability of the convolutional neural network, more and more neural-network-based three-dimensional object detection methods appear in recent years. Some deep learning based methods utilize convolutional neural networks for the detection and identification of general class objects. Such as the classical Faster R-CNN, YOLO, SSD, etc. algorithms. These networks have powerful classification and detection capabilities by training on large data sets. The object detection and identification method is often used as a basic network. In addition to detecting the class of objects in the image, there is a further need to obtain an estimate of the pose of the corresponding three-dimensional object. Such methods fall into two main categories. The first method directly uses a neural network to regress the pose of the three-dimensional object, the other method firstly uses the network to obtain the corresponding position of the key point of the three-dimensional object (usually eight vertexes of a bounding box where the three-dimensional object is located and the central point of the bounding box) on the two-dimensional image, and the output of the network is only an intermediate result of the task. And after the 2D-3D correspondence is obtained, the pose of the three-dimensional object is calculated by utilizing a PnP algorithm. Compared with a method for directly utilizing the network regression pose, the method can obtain higher precision. Still other methods utilize a segmentation network to identify image regions containing three-dimensional objects and regress the locations of keypoints in the corresponding regions, which can also yield better results. Such two-stage processes also have certain drawbacks: in the case of a large-area occlusion or truncation of a three-dimensional object, key points are occluded, and although the neural network can predict the positions of invisible key points by memorizing similar modes, an accurate prediction result cannot be given, so that subsequent pose solving fails.

Tracking methods for three-dimensional objects generally include feature point-based methods, model-based methods, neural network-based methods, and SLAM-based methods. The method based on the feature points is suitable for objects with rich textures, and most algorithms rely on feature point extraction algorithms, feature point description algorithms and feature point matching algorithms with excellent performance. For a task of tracking a non-textured three-dimensional object, it is necessary to obtain a model of the three-dimensional object in advance, and then track the object using edge information or texture, color information, and the like of an image. With the continuous maturity of the SLAM technology, the SLAM algorithm can well complete the tracking task under different environments. The SLAM algorithm can obtain self pose information in an unknown environment and reconstruct a three-dimensional map of the environment. Knowing the data of the sensor, the SLAM algorithm solves a state estimation problem to complete the solution of the camera pose and the reconstruction of the environment. According to the SLAM-based method, the point cloud of the three-dimensional object is registered in an SLAM coordinate system, the camera pose output by the SLAM system is utilized, and the pose is verified and updated by using a certain strategy, so that the object can be accurately and stably tracked.

The currently mainstream three-dimensional object detection and tracking methods have the following advantages and disadvantages: the feature point-based method is more effective for objects with rich textures, but when the environment is too complex and is integrated with a three-dimensional object, mismatching of more feature points occurs, thereby causing object detection or tracking failure. Edge or region based algorithms are more friendly to non-textured objects but do not work well in cases where the object is partially occluded or the illumination changes are relatively strong. The three-dimensional object detection and identification method based on deep learning has high accuracy in an environment similar to a training data set, but has general generalization capability. In addition, the deep learning-based method needs to perform individual training on a class of objects to be detected, even each object, and training parameters need to be finely adjusted according to the three-dimensional objects to be detected, so that a large amount of time and cost are consumed. Additionally, due to the limitations of the data set, objects that can be detected and tracked are limited to the designated three-dimensional objects provided by the training data sets of LINEMOD, YCB, etc. Therefore, the three-dimensional object detection and tracking algorithm based on deep learning cannot detect and track the three-dimensional object in daily life, and the algorithm lacks universality and practicability.

The above methods mostly require the use of scanners or the advance modeling of three-dimensional objects through cumbersome steps, thus limiting their use in mass-oriented augmented reality applications. Because the popular augmented reality application needs to be faced with a large number of three-dimensional objects of different shapes and textures in daily life, the three-dimensional objects are difficult to model one by one in advance. At present, only a few commercial products at home and abroad, such as ARCore, ARKit, Vuforia and the like, can realize universal detection and tracking on three-dimensional objects which are wide daily. These software products accomplish the scanning process through a very simple scanning procedure, such as having the user hold a consumer grade cell phone and rotate it around the object. Apart from these few products, no open source solution has achieved a similar effect. Aiming at the background, the invention tries to design a detection and tracking scheme facing to daily three-dimensional objects, and combines a SuperPoint method to obtain a high-efficiency and robust three-dimensional object detection and tracking system.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a three-dimensional object detection and tracking method. The invention obtains the SuperPoint point cloud and other related information of the three-dimensional object to be detected and tracked by combining the scanning module with the visual odometer, and in the detection and tracking processes, the detection and matching result of the robust characteristic points of the SuperPoint algorithm is relied on, and the detection and matching result is combined with the visual odometer, so that the common three-dimensional object in daily life can be efficiently and robustly tracked.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a three-dimensional object detection and tracking method, which comprises the following steps:

step 1: scanning phase

Operating a visual odometer to obtain accurate pose of the key frame and connection relation information between frames; extracting SuperPoint characteristics from a key frame set by the visual odometer, and calculating descriptor information; constructing a SuperPoint point cloud of a key frame in a scanning process by utilizing an SfM algorithm, and storing SuperPoint point cloud information, SuperPoint characteristic point information of the key frame and connection relation information between key frames into a file;

step 2: detection phase

And (3) performing 2D-2D and 2D-3D SuperPoint characteristic matching on the current frame to be detected and the three-dimensional object by using the three-dimensional object information obtained in the step (1) and combining information stored in the scanning process, and solving by PnP to obtain the pose of the object in the frame to be detected.

And step 3: tracking phase

Obtaining the pose of an object in the frame to be tracked by utilizing optical flow matching and reprojection matching between two adjacent frames to be tracked; tracking of non-key frames is accomplished using a visual odometer.

The invention extracts the SuperPoint characteristics and adopts an improved SuperPoint algorithm, wherein the improvement on the SuperPoint algorithm comprises the following steps:

1) on the basis of the SuperPoint algorithm, replacing a dense loss function with a sparse loss function to improve the loss function of the SuperPoint algorithm; specifically, N positive descriptor pairs are sampled randomly, and then M negative descriptor pairs are sampled for each positive descriptor pair, so as to finally obtain M × N positive and negative descriptor pairs;

2) and modifying the dimension of the descriptor of the SuperPoint network output, wherein the SuperPoint network outputs a 64-dimensional descriptor.

In step 1, the SuperPoint point cloud of the key frame in the scanning process is constructed by utilizing the SfM algorithm, and the method specifically comprises the following steps:

after the scanning process is finished, all key frames screened out by the SLAM are obtained; extracting SuperPoint characteristic points for each key frame and calculating a descriptor;

selecting t key frames with the best common-view relationship for each frame according to the common-view relationship among the key frames obtained by the SLAM system, extracting SuperPoint characteristics, and carrying out pairwise triangularization operation;

the positions and postures of the keyframes obtained by the visual odometer in the scanning process are accurate enough, the positions and postures of the keyframes are fixed, and the optimal position estimation of the map points is obtained by utilizing the clustering adjustment to optimize the reprojection error.

Further, the step 2 is as follows:

for each frame to be detected, two key frames which are most similar to the current frame to be detected in the scanning process are screened out by using bag-of-words matching, a 2D-2D matching relation is obtained by screening through a KNN matching algorithm and a RANSAC algorithm, the 2D-3D matching is a corresponding relation between the feature points of the current frame to be detected and three-dimensional point cloud, and more accurate pose can be obtained by combining 2D and 3D information.

Compared with the prior art, the invention has the advantages that:

the invention builds a detection and tracking system for common three-dimensional objects in daily life on the basis of visual mileage.

Because the feature points and descriptor matching results obtained by the traditional method are not robust under special conditions, the improved SuperPoint feature point extraction network is used for replacing the traditional ORB feature detector to obtain more robust feature matching results.

In a special scene, for example, an object is partially shielded, a camera moves quickly, and other objects similar to the object to be tracked exist in the scene.

Drawings

FIG. 1 is a scanning flow diagram of the present invention;

FIG. 2 is a detection and tracking flow diagram of the present invention;

FIG. 3 is a schematic diagram of the detection module obtaining a related three-dimensional point cloud using the connection relationship obtained by the scanning process;

FIG. 4 shows three-dimensional objects with rich texture tested by the present invention;

FIG. 5 is a partial qualitative comparison of the present invention on a general test set;

FIG. 6 is a partial qualitative comparison of the present invention on a complex data set; wherein 6-1 shows the case where the object is partially occluded, and 6-2 shows the case where the camera moves faster; 6-3 illustrates the case where the object to be tracked is placed in an environment different from the scanning phase; 6-4 illustrates the situation where there are other objects in the environment that are similar to the object to be tracked.

FIG. 7 is a qualitative comparison of the present invention with current mature related products; wherein 7-1 is the comparison result in a general scene, and 7-2 is the partial comparison result in a complex scene.

Detailed Description

The invention is described in detail below with reference to the drawings. The technical features of the embodiments of the present invention can be combined correspondingly without mutual conflict.

The invention discloses a three-dimensional object detection and tracking method. An accurate and robust three-dimensional object detection and tracking scheme is obtained by using an improved SuperPoint algorithm and combining a visual odometer. The three-dimensional object detection and tracking system comprises two stages, a scanning stage and a detection and tracking stage. The three-dimensional object is simply and fully scanned in the scanning stage, and the detection and tracking of the three-dimensional object can be completed in the detection and tracking stage.

The scanning process of three-dimensional object detection and tracking is shown in fig. 1, and mainly includes the following steps:

step 1.1: the user scans the plane where the three-dimensional object is located, the point cloud obtained by the ORB-SLAM3 system and a plane detection algorithm are used for obtaining the plane in the current scene, the user selects the plane where the three-dimensional object is located as a candidate plane, and an initial three-dimensional bounding box is placed on the plane.

Step 1.2: and modifying the size and the position of the initial enclosing frame by using a keyboard and a mouse, so that the three-dimensional enclosing frame just wraps the subsequent three-dimensional object to be detected and tracked.

Step 1.3: and after the three-dimensional enclosure frame is obtained, fully scanning five surfaces (except the bottom surface) of the three-dimensional enclosure frame, and calculating the pose of the key frame by utilizing the tracking and mapping functions of the ORB-SLAM3 system in the scanning process.

Step 1.4: and extracting SuperPoint key points from the key frames obtained by the SLAM system, and selecting t key frames with the best common view relation for each frame to carry out pairwise triangularization operation according to the common view relation between the key frames obtained by the SLAM system.

Step 1.5: because the key frame pose parameters obtained by the SLAM system are accurate enough, the pose of the key frame is fixed, and only the three-dimensional coordinates of map points are optimized. And finally, storing the SuperPoint point cloud information of the three-dimensional object, the SuperPoint characteristic point information of the key frame and the connection relation information of the key frame into a file.

The improvement of the SuperPoint algorithm of the invention specifically comprises the following steps:

1) the loss function of the SuperPoint algorithm is improved, and the memory consumption required during training is reduced on the premise of ensuring the precision:

for a pair of images, the loss function of the original SuperPoint algorithm needs to use all possible descriptor pairs in the two descriptor sets. Length and width H of network output image_cAnd W_cIndicate that then needTo generate a sum (H)_c×W_c)²A positive and negative descriptor pair. When the loss function is used for training the network, the fact that a large amount of memory is needed for training is found, and a PC end with an 8G GPU video memory cannot complete training. This document attempts to replace dense losses with sparse ones. Specifically, we first sample N positive descriptor pairs randomly, and then sample M negative descriptor pairs for each positive descriptor pair, resulting in M × N positive and negative descriptor pairs. Experiments show that compared with the use of a dense loss function, the result trained by the sparse loss function has similar precision, does not need to occupy excessive memory, and can successfully complete the training of the network.

2) And adaptively modifying the descriptor dimension of the SuperPoint network output:

the original SuperPoint algorithm network outputs 256-dimensional descriptors, and when the descriptors are applied to a detection and tracking system for testing, the length of the descriptors is too large, the size of an information file stored in a scanning stage is obviously increased, the information file read in the detection and tracking stage occupies more memory, and the time consumption of 2D-3D matching and 2D-2D matching in the detection stage is obviously increased. In order to solve the problem, the SuperPoint network is trained to output descriptors with different dimensions, and the express boxes are utilized to test the influence of the descriptors with different lengths on the detection and tracking system on three groups of conventional data sequences. The descriptor lengths of the tests include 256 dimensions, 128 dimensions, 64 dimensions, and 32 dimensions. The descriptors of different dimensions have table shown for the scanning result, the detection module precision, the tracking module precision and the time-consuming result. It can be seen that as the length of the descriptor decreases, the three-dimensional object information file obtained in the scanning stage also decreases, and the subsequent detection and tracking stage occupies less memory. In the detection and tracking stage, the 2D-3D matching uses the connection relation, so that the time consumption of the part is not obviously changed, but the time consumption of the 2D-2D matching is reduced along with the reduction of the descriptor length, so that the detection frequency is increased, and the detection effect is better. However, in the case that the descriptor length is too small, it can be seen that the uniqueness of the descriptor is reduced, resulting in fewer inliers obtained by the tracking module and a reduced success rate. And finally, comprehensively considering the precision and efficiency of the system and the precision influence of outputting descriptors with different lengths on feature point matching, and finally using 64-dimensional descriptors to perform subsequent scanning, detection and tracking.

The scanning stage SfM algorithm specifically comprises the following steps:

a) after the scanning process is finished, all the key frames screened by the SLAM can be obtained. Extracting SuperPoint characteristic points for each key frame and calculating a descriptor;

b) and selecting t key frames with the best common-view relationship for each frame according to the common-view relationship among the key frames obtained by the SLAM system, extracting SuperPoint characteristics, and carrying out pairwise triangularization operation.

c) The positions and postures of the keyframes obtained by the visual odometer in the scanning process are accurate enough, the positions and postures of the keyframes are fixed, and the optimal position estimation of the map points is obtained by utilizing the clustering adjustment to optimize the reprojection error.

The detection and tracking process of three-dimensional object detection and tracking is shown in fig. 2. Wherein the detection part comprises the following steps:

step 2.1: and extracting SuperPoint characteristic points for the current frame to be detected.

Step 2.2: and obtaining two candidate key frames which are stored in the scanning process and are similar to the current frame to be detected by utilizing bag-of-words matching. And obtaining map points associated with the candidate key frames by using the connection relation of the key frames stored in the scanning process, and using the map points as candidate map points matched by the detection module in a 2D-3D mode.

Step 2.3: and (4) performing 2D-2D matching on the feature points in the two candidate key frames in the step two and the current key frame to be detected, performing 3D-2D matching on the candidate map points obtained in the step two and the current frame to be detected, combining the 2D-2D and 3D-2D information, and solving to obtain the pose of the current frame to be detected.

In one embodiment of the present invention, the implementation process of step two is introduced:

the time consumed by the detection module is mainly in KNN violence matching. Due to the large number of point clouds, violent matching takes much time. The time consumed by matching is reduced, the detection frequency can be improved, and the higher the detection frequency is, the more accurate pose prior of the tracking module is possible. And the SLAM can output more accurate point cloud and camera pose through sufficient optimization due to the reduction of the calculation complexity. For a frame of image to be detected, map points associated with the frame of image are only a small part of the whole three-dimensional object point cloud. Since the line of sight is obscured by the three-dimensional object itself, at least one quarter of the three-dimensional object will not appear in the image. Therefore, the map point with the visible view range of the current frame can be found by fully utilizing the connection information of the key frame and the map point stored in the scanning process, the violent matching of all three-dimensional point clouds is avoided, and only a small part of related point clouds are subjected to violent matching by utilizing the connection relation. Therefore, the time consumption of the detection module is reduced, and the system performance is improved.

Specifically, when a current frame to be detected exists, two key frames KF which are stored in the scanning process and are most similar to the current frame are obtained by utilizing bag-of-word matching₁And KF₂. Through the connection relation constructed by the SLAM system in the scanning process, n key frames KF with better common-view relation with the two key frames can be obtained₃～KF_n+2Then the map points that need to participate in violence matching are the map points visible for these n +2 key frames. Fig. 3 is a partial diagram for distinguishing map points by using a connection relationship, in which a cone is n +2 key frames obtained according to the above algorithm, a darker point is a point cloud of a three-dimensional object, and a lighter point is a map point associated with the n +2 key frames. The method has the advantages that the point cloud is screened by using the connection relation, a large number of map points which are invisible under the current visual angle can be eliminated, and meanwhile, the probability of violence mismatching is reduced, so that the time consumption of a detection module is reduced.

The tracking part in the detection and tracking process of the three-dimensional object detection and tracking comprises the following steps:

step 3.1: and extracting SuperPoint characteristic points from the current frame to be tracked.

Step 3.2: and re-projecting the pose information obtained by the latest detection frame to the current frame to obtain the 3D-2D feature point matching information. Meanwhile, matching the previous key frame to be tracked with the optical flow of the current frame to be tracked to obtain 3D-2D matching, and solving by PnP to obtain the pose of the frame to be tracked.

Step 3.3: for the tracking of non-key frames, visual odometer-assisted tracking is utilized.

The pose decision process of three-dimensional object detection and tracking is as follows:

the pose decision module stores a state for the detection and tracking module respectively. And designing a counter mechanism, and setting the corresponding state as detection failure or tracking failure only when the frequency of the detection or tracking module for continuously solving the pose failure reaches a threshold value. By the method, the problem of discontinuous detection or tracking effect caused by the fact that the inner point is smaller than the threshold value is solved for the PnP of a certain frame due to the fact that the number of the feature points of certain angles is small because of the fixed threshold value setting can be avoided.

Examples

In order to further demonstrate the implementation effect of the present invention, the present embodiment records the detection and tracking data set by using three objects with rich textures beside, where the three objects are as shown in fig. 4. Wherein the data set comprises 3 groups of detection and tracking data sets under normal conditions, the object is partially occluded, the camera moves faster, and 3 groups of detection and tracking data sets of similar objects exist in the environment.

Evaluation indexes are as follows:

for each group of detection and tracking data sets, the present embodiment respectively counts the number of key frames in each group of sequences of the reference scheme, the number of interior points after the detection frame PnP algorithm ends, the number of successful detection frames, the number of interior points after the tracking frame PnP algorithm ends, the number of successful tracking frames, the number of successful system frames, the system assembly power, and other information, wherein a frame with the detection module PnP interior point number greater than a threshold is recorded as a successful detection frame, a frame with the tracking module PnP interior point number greater than the threshold is recorded as a successful tracking frame, a frame with the pose decision marked as successful by the system is recorded as a successful system frame, and the system assembly power frame number is divided by all the key frame numbers to obtain the system assembly power.

Experiment 1: qualitative results of improved SuperPoint algorithm

In this embodiment, descriptors with lengths of 256 dimensions, 128 dimensions, 64 dimensions, and 32 dimensions are obtained by using the SuperPoint network for training respectively. The results on the HPatches dataset are shown in Table 1, with bolded letters indicating superior results. As can be seen from the table, the present invention uses a sparse loss function instead of a dense loss function, which can greatly reduce the memory required during network training while maintaining network performance. Meanwhile, the method can be analyzed, and both evaluation indexes of the descriptors are slightly reduced along with the reduction of dimensionality of the network output descriptors. And finally, combining the performance comparison result of the SuperPoint in the detection tracking system (Table 2), and enabling the SuperPoint network to output 64-dimensional descriptors.

TABLE 1 comparison of SuperPoint Pre-training models on HPatches datasets with model results obtained by the present invention

TABLE 2 influence of the lengths of the different descriptors on the speed and accuracy of the detection tracking module (time unit: ms)

Experiment 2: robustness testing

The performance of the invention in a general detection and tracking scenario is tested by a large number of experiments. The quantitative results are shown in Table 3.

Table 3 statistics of the results of the present invention in detection and tracking

As can be seen from Table 3, the number of internal points of the system for detection and tracking is obtained based on the SuperPoint scheme, and the use of the SuperPoint feature can be used for better matching and better assisting the detection and tracking process. For several data sequences of these three objects, the total power is mostly above 90%. Fig. 5 shows qualitative results, and points in the graph represent results obtained by projecting the three-dimensional point cloud obtained in the scanning process according to the pose output by the system. It can be qualitatively seen that the invention can output correct pose results while having high system success rate.

Experiment 3: robustness testing of the invention

In the embodiment, a large number of experiments are used for testing the robustness of the invention under extreme conditions, firstly, the camera moves violently and the object is shielded seriously, and the quantitative result is shown in table 4. It can be seen that the system also has a high success rate in both extreme cases. Partial qualitative results are shown in FIGS. 6-1 and 6-2, wherein 6-1 shows the qualitative results of the present invention in which an object is partially occluded; 6-2 shows the qualitative results of the present invention in a camera fast motion scenario. The quantitative result is combined, so that the method has high robustness, and a correct pose result can be obtained under the extreme conditions. The experimental results show that the invention can have better performance under the condition that the object is shielded by a larger area; the present invention also provides good results if the SLAM system can maintain stable tracking in situations where camera motion is severe.

Table 4 robustness result statistics in complex scenarios of the present invention

In addition, the present embodiment attempts to move and rotate the object slightly during the detection and tracking, or to place the object in an environment different from the scanning stage for testing. Taking the express delivery box as an example, qualitative results are given in fig. 6-3. The method is robust, and can obtain correct pose results when an object is moved or in completely different scenes. The present embodiment also performs correlation analysis for a complex case where an object similar to the object to be tracked exists in the environment. Taking the express box as an example, the qualitative result of the invention is shown in fig. 6-4. It can be seen that the algorithm utilizes the descriptor to perform feature matching, and the invention can also obtain correct pose output.

Experiment 4: qualitative comparison result of the invention and commercial product

More similar to the detection and tracking scenario discussed herein, the schemes Vuforia and ARKit provided by Vuforia and ARKit provide a set of scanning procedures for users, stable object identification and tracking can be performed by using the model file generated after scanning, and object detection and tracking can be realized by operating the ARKit by users. The detection and tracking process does not need a user to provide a 3D model, and is more in line with the requirements of augmented reality application and more in line with user experience. In addition, in contrast to ARKit, Vuforia needs to use a planar marker to assist the scanning process during scanning.

Since the codes of ARKit and other products are not open to the outside, we can only compare the system presented here qualitatively with the most closely-processed ARKit. Qualitative comparison experiments were performed herein by installing the ARKit SDK on iPhone11 and in the same scenario as the present system.

The ARKit uses a monocular camera at a mobile phone end to complete the task of detecting and tracking an object, therefore, the SLAM system of the invention is switched to a monocular pure visual mode to complete the embodiment, and a millet 9 rear camera is used for recording and testing data. Due to the problem of scale uncertainty of monocular pure vision SLAM, the scale in the scanning process and the detection and tracking process is aligned in the initial stage of detection and tracking. The partial qualitative comparison result of the present invention with the ARKit scheme is shown in FIG. 7, FIG. 7-1 shows the general case, and FIG. 7-2 shows the special case where the object to be tracked is blocked or a similar object exists in the environment. The odd lines in the image are the result of the ARKit operation at the mobile phone end, and the even lines are the result obtained by the operation of the system. Since the camera parameters (FOV) of iPhone and millet 9 are different, the images obtained at the same viewing angle have slight difference, but do not affect the qualitative comparison. Compared with similar foreign three-dimensional object detection tracking mature products, the method has high completion degree and similar performance.

The following conclusions can be drawn from comparative experiments:

1) the detection and tracking system obtained by the invention can run efficiently and robustly in a general detection and tracking scene.

2) For special scenes, such as the situation that an object is partially shielded, a camera moves faster, and other objects similar to the object to be tracked exist in the scene, the method can also obtain better results.

3) Compared with mature commercial products with similar use scenes, the invention can achieve similar effects.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A three-dimensional object detection and tracking method is characterized by comprising the following steps:

step 1: scanning phase

step 2: detection phase

Performing 2D-2D and 2D-3D SuperPoint characteristic matching on the current frame to be detected and the three-dimensional object by using the three-dimensional object information obtained in the step 1 and combining information stored in the scanning process, and solving by PnP to obtain the pose of the object in the frame to be detected;

and step 3: tracking phase

Obtaining the pose of an object in the frame to be tracked by utilizing optical flow matching and reprojection matching between two adjacent key frames to be tracked; tracking of non-key frames is accomplished using a visual odometer.

2. The method for detecting and tracking the three-dimensional object according to claim 1, wherein an improved SuperPoint algorithm is adopted for extracting the SuperPoint characteristics in the step 1, wherein the improvement of the SuperPoint algorithm comprises:

1) improvement of loss function

On the basis of the SuperPoint algorithm, replacing a dense loss function with a sparse loss function to improve the loss function of the SuperPoint algorithm; for a pair of homographic transformed image pairs input into a network, the result of the branch output generated by the descriptor is a pair of descriptor maps with the dimension of Hc × Wc × 256, and the loss function of the original SuperPoint algorithm needs to be applied to all possible descriptor pairs in the descriptor set, namely, a pair (Hc × Wc) needs to be generated²A plurality of positive and negative descriptor pairs; the improved loss function randomly samples N positive descriptor pairs in the two descriptor sets, then samples M negative descriptor pairs for each positive descriptor pair, and finally obtains M multiplied by N positive and negative descriptor pairs;

2) improving descriptor dimensionality of output

And adaptively modifying the dimensionality of the descriptor output by the SuperPoint network, wherein the SuperPoint network outputs a 64-dimensional descriptor.

3. The method for detecting and tracking the three-dimensional object according to claim 1, wherein the step 1 of constructing the SuperPoint point cloud of the key frame in the scanning process by using the SfM algorithm specifically comprises the following steps:

selecting t key frames with the best common-view relation for each frame according to the common-view relation among the key frames obtained by the SLAM system to carry out pairwise triangularization operation;

4. The method for detecting and tracking a three-dimensional object according to claim 1, wherein the step 1 specifically comprises:

1.1: scanning a plane where a three-dimensional object is located, obtaining the plane in a current scene by utilizing a point cloud obtained by an ORB-SLAM3 system and a plane detection algorithm, selecting the plane where the three-dimensional object is located as a candidate plane, and placing an initial three-dimensional bounding box on the plane;

1.2: modifying the size and the position of the initial surrounding frame to enable the three-dimensional surrounding frame to just wrap the subsequent three-dimensional object to be detected and tracked;

1.3: after the three-dimensional bounding box is obtained, fully scanning the three-dimensional bounding box to remove five surfaces of the bottom surface, and calculating by utilizing the tracking and mapping functions of an ORB-SLAM3 system in the scanning process to obtain the pose of the key frame;

1.4: extracting SuperPoint key points from the key frames obtained by the SLAM system, and selecting t key frames with the best common view relation for each frame to carry out pairwise triangularization operation according to the common view relation between the key frames obtained by the SLAM system;

1.5: because the key frame pose parameters obtained by the SLAM system are accurate enough, the pose of the key frame is fixed, only the three-dimensional coordinates of the map points are optimized, and finally the SuperPoint point cloud information of the three-dimensional object, the SuperPoint characteristic point information of the key frame and the connection relation information of the key frame are stored in a file.

5. The method for detecting and tracking a three-dimensional object according to claim 1, wherein the step 2 is:

6. The method for detecting and tracking a three-dimensional object according to claim 1, wherein the step 2 specifically comprises the steps of:

2.1: extracting SuperPoint characteristic points for a current frame to be detected;

2.2: obtaining two candidate key frames which are stored in the scanning process and are similar to the current frame to be detected by utilizing bag-of-words matching, obtaining map points associated with the candidate key frames by utilizing the connection relation of the key frames stored in the scanning process, and using the map points as the candidate map points matched by the detection module in the 2D-3D mode;

2.3: 2D-2D matching is carried out on the feature points in the two candidate key frames in the step 2.2 and the current key frame to be detected, 3D-2D matching is carried out on the candidate map points obtained in the step 2.2 and the current frame to be detected, and the pose of the current frame to be detected is obtained by combining the 2D and 3D information; specifically, by using RANSAC, a rotation matrix and a translation matrix are obtained by using 6 pairs of 2D-2D matching, and then, scale information is restored by using a pair of 2D-3D matching.

7. The method for detecting and tracking a three-dimensional object according to claim 1, wherein the step 3 specifically comprises the steps of:

3.1: extracting SuperPoint characteristic points from a current frame to be tracked;

3.2: using pose information obtained by the latest detection frame to re-project to the current frame to obtain 3D-2D feature point matching information, simultaneously using the last key frame to be tracked to match with the current optical flow of the frame to be tracked to obtain 3D-2D matching, using PnP to solve to obtain the pose of the frame to be tracked,

3.3: for the tracking of non-key frames, visual odometer-assisted tracking is utilized.