CN113052311B

CN113052311B - Feature extraction network with layer jump structure and method for generating features and descriptors

Info

Publication number: CN113052311B
Application number: CN202110281763.8A
Authority: CN
Inventors: 杨宁; 韩云龙; 郭雷; 方俊; 钟卫军; 徐安林
Original assignee: BEIJING INSTITUTE OF TRACKING AND COMMUNICATION TECHNOLOGY; Northwestern Polytechnical University; China Xian Satellite Control Center
Current assignee: BEIJING INSTITUTE OF TRACKING AND COMMUNICATION TECHNOLOGY; Northwestern Polytechnical University; China Xian Satellite Control Center
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2024-01-19
Anticipated expiration: 2041-03-16
Also published as: CN113052311A

Abstract

The invention relates to a feature extraction network with a layer jump structure and a descriptor generation method, wherein the network is an image feature extraction network with the layer jump structure, and output feature graphs of a conv2_2 layer, a conv2_3 layer and a conv4_3 layer of VGG16 are fused to obtain all detail information, and the feature point positioning precision is effectively improved. The uniqueness index of a feature point refers to the degree of similarity between a certain local area of an image and other areas of the image. The uniqueness of each location in the image, i.e., the degree of similarity of each location in the image to all other locations, is measured by a uniqueness score. The matching performance of the network is improved by selecting sufficiently unique feature points in the image. The present invention achieves leading performance on the HPatches dataset of image retrieval, particularly on its illumination sequence.

Description

Feature extraction network with layer jump structure and method for generating features and descriptors

Technical Field

The invention belongs to the field of image processing and computer vision, and relates to a feature extraction network with a layer jump structure and a method for detecting image feature points and generating descriptors by using the network.

Background

In many applications, such as visual localization, object detection, pose estimation, three-dimensional reconstruction, etc., it is crucial to extract feature points and descriptors of an image. In these tasks, it is desirable to obtain high feature matching accuracy and high feature point detection accuracy. The high feature matching accuracy rate means that for two pictures to be matched, the error matching is as little as possible when the feature point matching is carried out; and the high feature point detection precision means that for successfully matched feature point pairs, the pointed positions of the feature point pairs in the two pictures are the same position in the scene in the image.

The classical feature point detection method is implemented in two stages, first detecting the keypoints and then computing a local descriptor for each keypoint. The first approach to keypoint detection by means of machine learning is FAST. Later, the SIFT method integrates the whole flow of image feature point detection and image local feature description, which is a typical mode of detection and description. The LIFT proposed by Yi et al completely completes the tasks of image feature point detection and image local feature description and matching by using a convolutional neural network for the first time, wherein a feature point detection network based on the convolutional neural network, an image local feature direction determination network and an image local feature description network are integrated. The method has the defects that the overall complexity of the network is high, and in training, the LIFT network still uses the sifted SIFT feature points as feature point labels, so that the limitation of the SIFT method cannot be removed. The Le-Net proposed by Ono et al extends the thought of the LIFT method as a whole, but greatly simplifies the network structure, and each step in the image feature point detection and image local feature description flow is not regarded as an independent module, but the network is designed as a whole. In training, the feature point detector is trained by using an unsupervised method, and the end-to-end training method is adopted as a whole, so that the network performance is improved, and the network complexity is reduced.

The detection-before-first method of detection that has emerged in recent years generally exhibits better performance than the previous detection-before-first method of detection, in that they use the same network to perform both the detection and the description tasks, with most of the parameters being shared between them, thus reducing the complexity of the network. In recent years, the super point proposed by Detone et al uses VGG-style network to extract image features, and uses a similar image super resolution method to detect feature point coordinates after convolution. SuperPoint's tags are detected by feature point detectors trained on synthetic datasets, eliminating the bias of manual labeling. The D2-Net network proposed by Dusmanu et al was described earlier and then highlighted. Which uses a VGG-16 network as a backbone, a feature point detector is connected in series after the output feature map of VGG-16. The D2-Net is characterized in that the characteristic point detector has no learned parameters, and the characteristic point is detected by a specific algorithm. Despite the simple structure, D2-Net still achieves the effect of being different from SuperPoint when appearing, and proves the feasibility of the idea. The training data used by the R2D2 network proposed by Revaud et al also has no artificial error, and adopts optical flow to replace MegaDepth data set to generate point correspondence, thereby providing a new idea for the training data. Meanwhile, the method provides reliability indexes of descriptors to eliminate mismatching.

However, in many methods of joint learning local feature points and descriptors, we consider two very large limitations: 1. the feature point positioning accuracy is very low, and the geometric problem of the camera cannot be effectively solved. 2. Many work on the design of the keypoint detector only with respect to repeatability, and may cause mismatch for some regions of similar texture.

The accuracy of the location of the keypoints has a great impact on the performance of many computer vision tasks, such as large projection errors of D2-Net in SFM. The low accuracy of keypoint localization is mainly due to the fact that keypoint detection is performed on low resolution feature maps (e.g. D2-Net is performed on 1/4 of the original image). In order to ensure better feature point precision, the SuperPoint up-samples the obtained low-resolution feature map through the VGG-like network structure to original resolution, then performs feature point detection through a pixel-level supervision point, and R2D2 adopts extended convolution to replace a pooling layer to ensure that the resolution of the feature map is unchanged, so that a large amount of calculation is increased. ASLFEat is improved on the basis of D2-Net, and feature point scores obtained by different resolutions are subjected to up-sampling fusion so as to obtain all feature points and maintain the spatial precision of the feature points. Although ASLFeat can solve the positioning precision of feature points with less calculation amount and obtain feature information of different layers, it only fuses the score graphs of different resolutions and only obtains a small amount of information of different layers.

In many images, there are a large number of parts with prominent textures, such as leaves in nature, windows of skyscrapers or sea waves, etc., and for the local gradient histogram-based method, although the map has a large number of positions with large gradients, which can be used as feature points, matching cannot be performed due to their similarity and instability. Meanwhile, many deep learning-based works focus only on repeatability in the design of the keypoint detector. On the other hand, methods for learning metric learning techniques for local robust descriptors are trained on repeatable locations, where they are in repeatable but not likely to match exactly, which can compromise performance. More recent methods of R2D2 address unstable texture regions by learning a reliability score for each dense descriptor.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a feature extraction network with a layer jump structure and a method for generating features and descriptors, which aims at solving the problems of the current popular method for jointly learning image feature points and descriptors and provides an image feature point detection and descriptor generation method with a layer jump structure. And then, detecting soft and hard characteristic points, and selecting correct characteristic points and descriptors by using channel scores and unique scores in the characteristic point detection. And the uniqueness score can effectively eliminate mismatching. Finally, the feature points and descriptors with high positioning accuracy and high accuracy are obtained.

Technical proposal

A feature extraction network having a layer jump structure, characterized by: the parts of the conv1_1 layer to the conv4_3 layer of the main structure VGG16 are removed; performing bilinear interpolation on the output feature graphs of the conv3_3 layer and the conv4_3 layer, upsampling to obtain the resolution of the output feature graph of the conv2_2, and performing tensor stitching on the conv2_2 layer and the upsampled feature graph to enable the output feature graphs of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the main body structure to be fused; and performing tensor stitching to obtain a feature map with 896 channels, and performing 1×1 convolution on the feature map to change the feature map into a feature map F with 512 channels.

The method for detecting the image characteristic points and generating the descriptors by adopting the characteristic extraction network with the layer jump structure is characterized by comprising the following steps:

step 1: selecting a visible light open source data set for marking, processing each image in the data set by using random homography change and color dithering, and forming an image pair by the processed image and the original image, wherein pixels between the image pairs are connected together through a homography matrix; taking the marked data set as a training set, and simultaneously selecting the data set with the mark as a verification set;

step 2: the feature extraction network F with the layer jump structure is used for carrying out feature extraction on the images in the training set to obtain a 512-dimensional feature map F=F (I),

step 3: extracting descriptors from the 512-dimensional feature map, regarding each channel vector as a dense description of the position of the channel vector, and performing L2 regularization on the channel vector to obtain dense descriptors of the image

Where i=1, …, h, j=1, …, w, d _ij For the dense description vector of the image, F is a 512-dimensional feature map;

step 4: detecting channel score and uniqueness score of feature point by soft feature point detector, and finally detecting channel score c _ij And a uniqueness score u _ij Multiplying to obtain a soft feature detector at pixel (i, j) score s _ij ＝c _ij u _ij ；

Wherein:

dense descriptorsThe uniqueness score of (2) is:

where i=1, …, h, j=1, …, w,for dense descriptors of images, u _ij For the uniqueness score of a dense descriptor of an image, U is U _ij Is a collection of (3);

description deviceThe channel score of (2) is:

where i=1, …, h, j=1, …, w, t=1, …, n,for descriptor->A value at which the channel is t;

step 5: and (3) performing loss calculation on the soft feature point detection scores by using a loss function, and then performing loss back propagation training on the feature extraction network with the layer jump structure in the step (2):

wherein I is ₁ ,I ₂ RGB image pair of input network, C is image pair I ₁ ,I ₂ The number of images between the two images corresponds to that of the image,respectively image pairs I ₁ ,I ₂ The total score of the feature points in (c), m (c) is the triple ranking loss;

step 6: and (3) extracting the feature image of the verification set by using the feature extraction network with the layer jump structure trained in the step (5), and selecting the pixel which has the largest channel and is unique compared with the other 75% pixels in the extracted 512-dimensional feature image as the feature point by using a hard feature detector to obtain the feature point and the descriptor of the test image.

Advantageous effects

The invention provides a feature extraction network with a layer jump structure and a feature and descriptor generation method, wherein the network is an image feature extraction network with the layer jump structure, and output feature graphs of a conv2_2 layer, a conv2_3 layer and a conv4_3 layer of VGG16 are fused to obtain all detail information, and the feature point positioning precision is effectively improved. The uniqueness index of a feature point refers to the degree of similarity between a certain local area of an image and other areas of the image. The uniqueness of each location in the image, i.e., the degree of similarity of each location in the image to all other locations, is measured by a uniqueness score. The matching performance of the network is improved by selecting sufficiently unique feature points in the image. The present invention achieves leading performance on the HPatches dataset of image retrieval, particularly on its illumination sequence.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

according to the invention, the feature fusion structure is added in the image feature extraction network, so that the feature map obtained by the network contains semantic information of different levels, wherein the semantic information of lower level can retain more information of lower level of the image, such as edges or corners, so that high-precision detection of the image features is possible, and the semantic information of higher level can provide guarantee for enhancing the accuracy of feature matching when finally carrying out local feature matching, and reducing mismatching. Meanwhile, the invention effectively solves the problem of mismatching easily generated in a texture region by designing the uniqueness detection of the feature points in the feature point detection stage. Through tests, compared with the effect of D2-Net in image matching, the method can improve the positioning accuracy of the feature points by 2 times when the projection error threshold is 1, and when the projection error is larger, the method has excellent effect, the average matching accuracy reaches 0.913, and the average matching accuracy is improved by 0.011 compared with the optimal ASLFEat at present.

Drawings

Fig. 1 is an overall structure diagram of the present invention, which includes three parts of image feature extraction, feature fusion, and feature point detection.

Fig. 2 is a diagram of a feature extraction network of the present invention.

FIG. 3 is a network training flow diagram.

FIG. 4 is a graph comparing the effect of feature point extraction on HPatches.

FIG. 5 is a graph comparing HPatches matching effects of the present invention.

Detailed Description

The invention will now be further described with reference to examples, figures:

the image characteristic point detection and descriptor generation method with the layer jump structure comprises the following steps: the design method comprises the following steps:

step 1: and selecting a visible light open source data set for marking, processing each image in the data set by using random homography change and color dithering, and forming an image pair by the processed image and the original image, wherein pixels between the image pairs are connected together through a homography matrix. And taking the marked data set as a training set, and simultaneously selecting the data set with the mark as a verification set.

Step 2: the images in the training set in step 1 are subjected to feature extraction by using a feature extraction network F with a layer jump structure as follows to obtain a 512-dimensional feature map F=F (I),where h×w is the spatial resolution of the feature map and n is the number of channels.

The feature extraction network with the layer jump structure is designed as follows: the portions of the conv1_1 layer through the conv4_3 layer of the body structure VGG16 have removed the full connection layer. Then, we fuse the output feature graphs of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the main structure, so that the spatial positioning precision of the feature points can be kept and the features with different layers can be fused. Firstly, carrying out bilinear interpolation on the output feature graphs of the conv3_3 layer and the conv4_3 layer, up-sampling to be the resolution of the output feature graph of the conv2_2, and then carrying out tensor stitching on the conv2_2 layer and the up-sampled feature graph. After tensor stitching we have obtained a feature map with 896 channels. This feature map is then convolved 1 x 1 to become a 512-channel feature map F.

Step 3: performing descriptor extraction on the 512-dimensional feature map extracted in the step 2, regarding each channel vector as a dense description of the position of each channel vector, and performing L2 regularization on each channel vector to obtain a dense descriptor of the image

Where i=1, …, h, j=1, …, w, d _ij For the dense description vector of the image, F is a 512-dimensional feature map.

Step 4: and calculating the channel score and the uniqueness score of the feature points by adopting a soft feature point detector, and finally multiplying the detected channel score and the uniqueness score to obtain the score of the soft feature point detector in the pixel (i, j).

Dense descriptorsUniqueness score u of (2) _ij The method comprises the following steps:

where i=1, …, h, j=1, …, w,for dense descriptors of images, u _ij For the uniqueness score of a dense descriptor of an image, U is U _ij Is a set of (3).

Description deviceThe channel score of (2) is:

where i=1, …, h, j=1, …, w, t=1, …, n,for descriptor->The value at which the channel is t.

The soft feature detector scores at pixel (i, j) as:

s _ij ＝c _ij u _ij

wherein c _ij Is a dense descriptorChannel score, u _ij Is a dense descriptor->Channel score, s _ij Is a dense descriptor->Is a total score of (2). Only when c _ij And u is equal to _ij Is large enough, s _ij Can be large enough, s _ij Reflects to what extent the spatial position (i, j) can be used as a feature point.

Step 5: and (3) performing loss calculation on the soft feature point detection scores in the step (4) by adopting the following loss function, and then performing loss back propagation training on the feature extraction network with the layer jump structure in the step (2).

The loss function is designed as follows: to train the network, we add the channel score and the uniqueness score obtained by the feature point detector to the loss function for training. For an image pair (I ₁ ,I ₂ ) It has a labeled pixel correspondence c:A∈I ₁ ,B∈I ₂ . Wherein A, B are respectively picture I ₁ ,I ₂ Upper pixels. The loss forms we use are:

wherein I is ₁ ,I ₂ RGB image pair of input network, C is image pair I ₁ ,I ₂ The number of images between the two images corresponds to that of the image,respectively image pairs I ₁ ,I ₂ In (c) is a triple ranking penalty that minimizes the corresponding descriptor +.>And->Maximizing the distance of the confusion descriptor +.>Or->Is a distance of (3).

By using the feature point scores as weights in the loss function, the sparsity of the loss function is ensured, and the network overfitting is effectively prevented. Decreasing m (c), i.e. increasing the distance to the description vector of the matching feature points and increasing the identifiability of the description vector; or reduceAnd reducing the area with high feature point scores in the picture.

Step 6: and (3) extracting the feature map of the verification set in the step (1) by using the feature extraction network with the layer jump structure trained in the step (5), and selecting the pixels with the largest channels and unique pixels which are 75% more than the other pixels in the extracted 512-dimensional feature map as feature points by using the following hard feature detector to obtain feature points and descriptors of the test image.

The hard feature detector is designed as follows:

the method comprises the steps of analyzing factors influencing characteristic point matching performance, providing a characteristic point uniqueness index from the perspective of improving the characteristic point matching performance, and designing a characteristic point detector based on characteristic uniqueness by combining analysis of characteristic point description vectors. The uniqueness of each location in the image, i.e., the degree of similarity of each location in the image to all other locations, is measured by a uniqueness score. We improve the matching performance of the network by selecting sufficiently unique feature points in the image.

We define that the uniqueness of a feature is the degree of similarity of a certain local area of an image to other areas of the image in the same picture. The less similar a local region of an image is to other local regions, the higher its uniqueness. For one location (i, j) of the dense description, the uniqueness of the description vectorUniqueness u of (2) _ij

Wherein U is U _ij Is unique u _ij Indicate dense descriptor vectorsAnd other dense descriptor vectorsIs a minimum distance of (2). The larger the minimum distance, the more unique the description vector is among all descriptor vectors. We sort U in descending order, record U _ij Ordered with order of p, p<α|U|。

In our invention we consider that the feature point locations (i, j, k) in dense descriptor d should be represented by descriptor vectorsUniqueness u of (2) _ij And->Channel extremum->Is determined by the position k of the (c). Wherein uniqueness u _ij The spatial position of the feature point is determined, the channel extremum k indicates what feature response is the largest in the descriptor vector, and we use its corresponding kth layer feature map to optimize the feature point spatial position (i, j).

For describing vectorIts channel extremum->Position k of (2) is: />Its uniqueness u _ij The order of (2) is p, and in the experiment, the effect is better when alpha is 0.25, namely the obtained characteristic points are unique compared with other 75% points. In addition, although the coordinates of the feature points are (i, j, k), due to +.>So that at most only one point in the description vector at each spatial position (i, j) can be used as a feature point.

The hard feature point detection conditions of our invention are:

where (i, j) is the spatial position in the feature map, i=1, …, h, j=1, …, w.For descriptor->The value when the channel is t, k is +.>The value of t at maximum, u is the uniqueness of the dense descriptor at (i, j), α is 0.75, and u is the set of uniqueness of all dense descriptors.

Specific examples:

referring to fig. 1, the present invention performs detection and local feature description of image feature points according to the following steps:

step 1: the train dataset of COCO2014 was selected for labeling, containing 82783 images. Each image in the data set is processed by using random homography change and color dithering, the processed image and the original image form an image pair, and pixels between the image pairs are connected together through homography matrixes. And taking the train data set of the annotated COCO2014 as a training set. The test set was trained using a standard HPatches dataset.

Step 2: referring to fig. 2, the training set generated in step 1 is subjected to feature extraction using a feature extraction network having a layer jump structure. The part of the conv1_1 layer to the conv4_3 layer of the feature extraction network with the main structure of VGG16 is removed. Meanwhile, in order to maintain the spatial positioning precision of the feature points and fuse the features with different levels, output feature graphs of a conv2_2 layer, a conv3_3 layer and a conv4_3 layer in the VGG-16 network are fused. First, the conv3_3 layer and the conv4_3 layer are subjected to bilinear interpolation, up-sampled to the resolution of conv2_2, and then tensor stitching is performed. After tensor stitching we get a feature map with 896 channels, i.e. three different levels of semantic information. And then carrying out 1X 1 convolution on the feature map, and fusing semantic features of different layers to obtain a feature map F of 512 channels.

Step 3: for the feature map F in the step 2, the feature map F contains detail information of different levels and has the resolution of the original map 1/2, and the invention directly performs descriptor extraction on the feature map F.

Dense description vector d of image:

where i=1, …, h, j=1, …, w. When comparing between the images, the descriptor vectors can be very conveniently used for establishing a corresponding relation by using Euclidean distance. As before, we describe the sub-vector d before making a comparison _ij Performing L2 regularization:

step 4: referring to the feature point detection section in FIG. 1, we employ soft feature detection on the feature map, each dense descriptorThe uniqueness score of (c) is u _ij Channel contrast score c _ij Calculate each dense descriptor +.>Total score s _ij 。

Step 5: we use the following form of penalty to calculate the soft feature point detection score of step 4, where m (c) is a triple ranking penalty that minimizes the corresponding descriptorAnd->Maximizing the distance of the confusion descriptor +.>Or->Is a distance of (3).

Referring to fig. 3, we perform training of the network in the order of steps 2,3,4, 5. In training the network, to reduce the computational effort, we will extract the feature map F ₁ ,F ₂ And (3) carrying out average pooling processing, changing the input resolution into 1/2, and then carrying out soft feature point detection. In order to obtain better training effect and save training time, we are in image net imageAdam is used for fine tuning training based on the weight of the classification task pre-training. At fine tuning, we unlock the conv2_2 layer, the conv3_3 layer, and the conv4_3 layer. During training, 8 pairs of images cut into 224×224 through the center and labels thereof are input in batches in a single iteration, and Adam's initial learning rate is selected to be 10 ^-5 A total of 40 batches were trained.

Decreasing m (c), i.e. increasing the distance to the description vector of the matching feature points and increasing the identifiability of the description vector; or reduceAnd reducing the area with high feature point scores in the picture.

Step 6: and (3) extracting the feature map of the verification set HPatches provided in the step (1) by using the feature extraction network with the layer jump structure trained in the step (5). The pixel with the largest channel and unique to the other 75% of pixels in the feature map is selected as the feature point, so that the feature point and the descriptor of the test image are obtained.

The test results are shown in FIG. 4. At the time of testing, we post-process the keypoints using SIFT-like edge elimination (threshold set to 10) and sub-pixel refinement, then bi-interpolate the descriptors at the refined locations.

The HPatches dataset is the dataset constructed by Balnas et al in its work for evaluating image feature descriptors. The data set comprises an image sequence of 116 scenes, wherein 59 scenes are view angle groups, the images are sequence images shot at different view angles of the same scene, and the images of the view angle groups are plane scenes; the other 57 scenes are illumination groups, and are image sequences of the same scene in fixed viewing angles and different illumination conditions. Each scene of the Hpatch dataset has 6 images, with the first being the reference image. In the experiment, we reject picture sequences with resolution greater than 1600 x 1200 and test using the remaining 52 sets of illumination sequences, 56 sets of view sequences. We first extract the feature points and descriptors of each sequence image using different methods, then use nearest neighbor search to match the feature points of each method, accepting only the nearest neighbors to each other. We use the average matching accuracy (MMA) as a verification index.

For each image pair, the present invention uses nearest neighbor searches to match the features extracted by each method, accepting only mutual nearest neighbors. A match is considered correct if the re-projection error of the unimaginance estimate provided using the dataset is below a given match threshold. To demonstrate the superiority of our invention we compared with different methods, where the traditional methods of combining feature points and descriptors have SuperPoint, LF-Net, D2-Net, R2D2 and the latest ASLFEat. We recorded MMA values for different methods at different thresholds, giving the comparative results in tables 1, 2.

See fig. 5. To better demonstrate our method, we quantitatively demonstrate our effects on HPatches dataset, we choose three groups in terms of feature points, and in effect we can obviously find that our method can effectively remove repetitive texture areas in the scene, such as leaves, grasslands and paved floors, which are very prone to mismatching due to their self-similarity and instability, despite the large number of feature points. In the aspect of feature point matching, one piece of the light source group and one piece of the view angle group are selected for comparison. In the illumination group, more matches can be obviously found between D2-Net and ASLFEat, but unstable texture areas such as sky, leaves and the like exist in the illumination group, and the matches are invalid. In the view-angle group we can clearly find that ASLFeat and D2-Net put more effort on these unstable matches, whereas our approach gives a match that is more representative.

Table 1 comparison of the validation effect of the present invention on HPatches validation set.

Table 2 the feature point positioning accuracy of the overall effect of the present invention is compared.

Claims

1. A method for detecting image characteristic points and generating descriptors by adopting a characteristic extraction network with a layer jump structure is characterized in that: the main structure of the feature extraction network with the layer jump structure is the part from the conv1_1 layer to the conv4_3 layer of VGG16, and the full connection layer is removed; performing bilinear interpolation on the output feature graphs of the conv3_3 layer and the conv4_3 layer, upsampling to obtain the resolution of the output feature graph of the conv2_2, and performing tensor stitching on the conv2_2 layer and the upsampled feature graph to enable the output feature graphs of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the main body structure to be fused; tensor stitching is carried out to obtain a feature map with 896 channels, and 1X 1 convolution is carried out on the feature map to change the feature map into a feature map F with 512 channels;

the method comprises the following steps:

d _ij ＝F _ij:

Wherein:

dense descriptorsThe uniqueness score of (2) is:

where i=1, …, h, j=1, …, w,dense descriptors of images, u _ij For the uniqueness score of a dense descriptor of an image, U is U _ij Is a collection of (3);

description deviceThe channel score of (2) is: