CN115240079A

CN115240079A - Multi-source remote sensing image depth feature fusion matching method

Info

Publication number: CN115240079A
Application number: CN202210792899.XA
Authority: CN
Inventors: 蓝朝桢; 王龙号; 施群山; 周杨; 张衡; 李鹏程; 吕亮; 胡校飞
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-10-25

Abstract

The invention belongs to the field of image feature matching, and particularly relates to a multi-source remote sensing image depth feature fusion matching method. According to the method, through feature extraction, fine fusion features with high-level and low-level information are obtained, feature transformation is performed on the high-level features, the similarity between matching points in the high-level features of the image to be matched is improved, and the matching result is more reliable. When later-stage features are matched, the high-level features are compared to perform rough matching to obtain a global matching result, and then the fine fusion features are used for correction, so that the matching result gives consideration to the reliability in the global aspect and the precision aspect, the resolution is higher, and the positioning is more accurate. The high-level features are subjected to sinusoidal coding before feature transformation, so that unique corresponding relation exists among the features, and mismatching caused by overhigh similarity among the features of sparse texture regions is avoided; and when coarse matching is carried out, the score threshold is also reduced through the adaptability of the sliding window, the number of the matched points screened out in the sparse region is increased, and the matching effect of the sparse region is further improved.

Description

Multi-source remote sensing image depth feature fusion matching method

Technical Field

The invention belongs to the field of image feature matching, and particularly relates to a multi-source remote sensing image depth feature fusion matching method.

Background

The multisource remote sensing images in the same area usually contain different characteristics of the same ground feature information, so the ground feature information can be fully displayed by combining the multisource remote sensing images, and the premise of jointly processing the multisource remote sensing images to discover the ground feature remote sensing information is to match the multisource images so as to identify the same-name points of two or more multisource images.

The current multi-source remote sensing image matching method usually applies a convolutional neural network to match the multi-element influence, such as CMM-Net and LoFTR algorithm. The CMM-Net algorithm extracts a high-dimensional feature map of a multi-source image by using a convolutional neural network, selects feature points according to the principle that two conditions of maximum channel and maximum local channel are simultaneously met, and finally completes multi-source image matching; the LoFTR algorithm strengthens the characteristic correlation through a Transformer, and obtains a dense matching result through rough matching and fine matching, and the matching effect in a sparse texture area is excellent, but the characteristic extraction network is limited by a single image scale, so that the requirements of matching in multiple aspects cannot be met.

Therefore, the existing various multi-source image feature matching methods still have various defects of difficult feature characterization, larger feature vector similarity difference or difficult matching caused by small number of features in sparse texture regions of the multi-source remote sensing images, and the like, so that the accuracy of the final matching result is influenced. Therefore, it is necessary to provide a remote sensing feature matching scheme capable of solving the above problems.

Disclosure of Invention

The invention aims to provide a depth feature fusion matching method for a multi-source remote sensing image, which is used for solving the problems that the accuracy of a final matching result is influenced due to the difficulty in feature characterization, the large similarity difference of feature vectors, the difficulty in matching caused by the small number of features in sparse texture regions of the multi-source remote sensing image and the like in the conventional multi-source image feature matching method.

In order to achieve the purpose, the invention provides a depth feature fusion matching method for a multi-source remote sensing image, which comprises the following steps:

1) Constructing a matching model, wherein the matching model comprises a feature extraction network, a feature transformation module, a dense matching module and a calibration optimization module;

2) Inputting the obtained remote sensing image pairs to be matched into a feature extraction network, and respectively extracting the features of each image to be matched by the feature extraction network to obtain the high-level features of each image to be matched and the fine fusion features of high-level fine positioning information and low-level global information;

3) Simultaneously inputting the high-level features of the obtained image pair to be matched into a feature transformation module, and performing feature transformation fusion on each image to be matched by fusing neighborhood information of the image and the high-level features of the image to be matched to obtain the high-level fusion features with correlation of each image to be matched in the image pair to be matched;

4) Carrying out dense matching on all the feature vectors on the high-level fusion features of the obtained image pair to be matched, and obtaining a coarse matching result according to the similarity between the feature vectors;

5) And mapping the coarse matching result to the fine fusion feature to perform calibration optimization on the dense matching to obtain a fine matching result.

The method has the beneficial effects that: during the feature extraction in the previous period, the obtained fine fusion features of the image have high-level information and low-level information simultaneously through feature extraction and fusion, so that the positioning accuracy and the global property and the anti-interference capability can be ensured; and the extracted high-level features are subjected to feature transformation fusion to obtain high-level fusion features so as to improve the similarity between matching points in the high-level features of the image to be matched and ensure that the matching result is more reliable. When the later-stage features are matched, the high-level fusion features are compared for rough matching to obtain a matching result more conforming to the global features, and then the fine fusion features are compared to correct the matching result, so that the matching result gives consideration to the reliability in the aspects of global and precision, the resolution is higher, and the positioning is more accurate.

Further, the feature extraction network comprises three down-sampling layers and two up-sampling layers;

the first down-sampling layer is used for down-sampling the input image to be matched to obtain a high-level feature map with the original dimension of 1/2 of the image to be matched; the second down-sampling layer is used for down-sampling the input high-level feature map of the original dimensionality 1/2 of the image to be matched to obtain a high-level feature map of the original dimensionality 1/4 of the image to be matched; the third down-sampling layer is used for down-sampling the input high-level feature map of the original dimension 1/4 of the image to be matched to obtain the high-level feature map of the original dimension 1/8 of the image to be matched;

the first up-sampling layer is used for up-sampling an input high-level feature map with original dimension 1/8 of an image to be matched into a low-level feature map with original dimension 1/4 of the image to be matched, meanwhile, performing convolution processing on the high-level feature map with original dimension 1/4, and then fusing the low-level feature map with original dimension 1/4 and the high-level feature map with original dimension 1/4 to obtain a fused feature map with original dimension 1/4; the second upsampling layer is used for upsampling the input fusion feature map with the original dimension of 1/4 into a low-level feature map with the original dimension of 1/2, simultaneously performing convolution processing on the high-level feature map with the original dimension of 1/2, and then fusing the low-level feature map with the original dimension of 1/2 and the high-level feature map with the original dimension of 1/2 to obtain a fine fusion feature map with the original dimension of 1/2;

and outputting a high-level feature map of the original dimension 1/8 and a fine fusion feature map of the original dimension 1/2 of each image to be matched for fusion feature matching through the three down-sampling layers and the two up-sampling layers. In order to improve the information globality, the positioning accuracy and the anti-interference capability of the multi-source remote sensing image characteristics, when the characteristics are extracted, the high-level characteristics and the low-level characteristics are fused twice, so that the finally obtained fine fusion characteristic diagram comprises the information of the characteristics of each layer, and the richer the contained characteristic information is, the higher the information globality and the positioning accuracy of the characteristics are.

Further, the fusion feature matching of the feature transformation module specifically comprises the following steps:

1) Adding position information to a high-level feature map with 1/8 of the original dimension of an image to be matched to enable the feature to be uniquely corresponding to the position of the feature on the original image;

2) Flattening a high-level feature map with position information of an image to be matched into a one-dimensional vector, inputting the one-dimensional vector to a feature transformation module, performing multiple interweaving processing through a concerned layer, and outputting high-level fusion features with correlation of each image to be matched;

the attention layer comprises a self-attention layer, a cross-attention layer and an attention layer; the self-attention layer is used for fusing local neighborhood information of the self-attention layer with the characteristics with position information of the reference image in the input image pair to be matched to generate a new characteristic diagram; the cross attention layer fuses the input reference image feature with position information and the feature of the other image to be matched in the image pair to be matched; the attention layer selects related information by measuring the similarity between the query vector and each key feature, and the selected result is subjected to normalization processing and then is superposed with the flattened one-dimensional vector of the high-level feature map with the position information of each image to be matched to obtain fusion position information, neighborhood information of the image to be matched and fusion features of the image information to be matched; and the multiple interweaving processing refers to inputting the obtained fusion features into a concerned layer for interweaving again, repeating the process, and finally outputting the high-layer fusion features of the images to be matched.

Because the correlation between the homonymous features of the multi-source remote sensing images is often low, the similarity between the matching points is low when matching is carried out, so that the matching points are difficult to accurately identify, therefore, the correlation of the high-level features of the images to be matched is improved through feature transformation, the accuracy of matching point identification can be improved, and a more reliable matching result is obtained.

Furthermore, in order to reduce parameters and calculation amount while ensuring the downsampling precision, the three downsampling layers all adopt a convolution block structure, the input feature diagram and the output feature diagram are subjected to element superposition, and the feature diagram obtained after superposition is used as a downsampling result.

The convolution block structure can simply execute the identity mapping in the forward propagation process of the neural network, and the one-time convolution result and the three-time convolution result are stacked together to be layered to obtain the down-sampling result, so that the identity connection mode does not increase extra parameters or computational complexity, and meanwhile, the training efficiency can be improved and the sampling precision can be ensured. Furthermore, the structure does not affect the back propagation of the training process.

Furthermore, in order to improve the matching effect of the multi-source image sparse texture region, the adding position information is to add sinusoidal codes to each pixel feature.

Because the sinusoidal coding can provide unique position information for each pixel, the features are uniquely corresponding to the positions of the pixels on the original image, and the features of different levels have uniquely determined corresponding relation, the problem of mismatching caused by overhigh similarity among feature vectors of sparse texture regions of the image is avoided, and the matching effect of the sparse texture regions is improved.

Further, the rough matching represents the similarity between all the feature vectors through a score matrix between the high-level fusion feature vectors, and if the similarity is greater than a score threshold value, the similarity is regarded as correct matching;

the scoring matrix S between the vectors is determined by the following formula, where <, > represents the inner product;

wherein, F _{A_tr} 、F _{B_tr} The image A and the image B to be matched are high-level fusion features, and AxB is all possible corresponding relations of pixels in the images A and B to be matched;

calculating a scoring matrix for all possible matching modes

By maximizing the total score sigma _i,j S _i, _j P _i,j To obtain an optimal allocation matrix P; the optimal distribution matrix P can represent the optimal correspondence between the high-level fusion feature vector in the image a to be matched and the high-level fusion feature vector in the image B to be matched.

Further, in order to accurately increase the coarse matching result of the sparse texture region, a sliding window adaptive score threshold detection algorithm is adopted for the coarse matching through the score matrix, and the method specifically comprises the following steps:

1) Setting an initial score threshold value as theta, setting the area of a sliding window, the horizontal sliding step length and the vertical sliding step length, and performing sliding detection on the score of the high-level fusion feature vector;

2) If the scores s of all high-level fusion feature vectors in the current window are less than theta, calculating an adaptive threshold value avg theta in the matched sparse node in the window; traversing the high-level fusion feature vectors in the window, if the score s of the vector existing in the current window is larger than avg theta, adding the vector into the coarse matching point set, and continuously sliding the window;

3) If the score s of the feature vector existing in the current window is larger than or equal to theta, the window continues to slide;

4) Repeating the steps until the window slides and traverses the high-level fusion characteristics of the images to be matched, and outputting a coarse matching point set;

the calculation method of the area of the sliding window, the horizontal sliding step length and the vertical sliding step length is as follows

ws is the area of the sliding window, hl is the horizontal sliding step length, and vl is the vertical sliding step length;

the calculation formula of the adaptive threshold value avg theta in the matching sparse node is as follows

Wherein n is the number of the characteristic vectors in the sliding window, s _i The matching scores of the feature vectors in the sliding window are obtained.

Since the number of the matching point pairs in the sparse texture region or the single texture region is less in the multi-source remote sensing image matching process, the number of the high-score matching point pairs is possibly lower than that in the dense matching region, the score threshold of the sparse matching region is reduced adaptively, the low-score matching point pairs which are increased in the sparse matching region can be screened, the matching data is supplemented, meanwhile, the redundant low-score point pairs in the dense matching region cannot be selected, and the introduction of errors is avoided.

Further, in order to improve the matching precision, the process of mapping the coarse matching result to the low-level features to perform calibration optimization on the dense matching is as follows:

1) Taking N pairs of feature points on high-level fusion features with correlation of each image to be matched as a center, wherein the N pairs of feature points refer to a coarse matching point set screened out after coarse matching; respectively cutting N pairs of local windows with the size of m multiplied by m on the corresponding high-level fusion characteristics;

2) Mapping the N pairs of windows into fine fusion features of the image to be matched to obtain N pairs of coarse windowsMatching the local fine window with the characteristic point pair as the center, inputting N pairs of local fine windows into a characteristic transformation module, transforming the windows for a plurality of times to generate N pairs of local fine fusion characteristic graphs of images A and B with the rough matching characteristic point pair as the center

3) Each will be

The feature vector corresponding to the center point P of (1) and corresponding

All vectors in (1) are correlated to generate

A desired value of a matching probability distribution of each pixel of (a) to P; the expected value of the matching probability distribution is calculated as follows

Wherein V _A (P) is

Feature vector of center point P, V _B (x) Is composed of

The characteristic vector of a certain pixel point x, and y is the pixel gradient of the pixel point x on the image B; the pixel point with the highest probability value obtained by calculation is a fine matching result of the point P on the image A with sub-pixel precision on the image B, and the result is taken as a final matching result.

Because the rough matching result is obtained under the condition of high-level fusion feature matching, and the high-level fusion feature description may have matching errors for the multi-source remote sensing image with larger difference, the rough matching result is relocated to the fine fusion feature, the best matching fine feature point is selected, and the precise matching result of the multi-source remote sensing image with higher resolution and more precise positioning can be obtained.

Furthermore, in order to ensure the accuracy of the matching result, after the fine matching result is obtained, a PROSAC algorithm is adopted to perform error matching inspection and elimination on the fine matching result again.

Drawings

FIG. 1 is a schematic diagram of a feature extraction network according to embodiment 1 of the present invention;

FIG. 2 is a diagram illustrating a convolution block structure according to embodiment 1 of the present invention;

FIG. 3 is a schematic view of a characteristic connection structure of embodiment 1 of the method of the present invention;

FIG. 4a is a visualization of H/8 xW/8 x 256 high-level feature map of embodiment 1 of the method of the present invention, and FIG. 4b is a visualization of H/2 xW/2 x 128 fine fusion feature map of embodiment 1 of the method of the present invention;

FIG. 5 is a flowchart of a feature transformation process according to embodiment 1 of the present invention;

fig. 6a is an optical image of a drone of a first image pair of a comparative example of the present invention, fig. 6b is a thermal infrared image of a drone of a first image pair of a comparative example of the present invention, fig. 6c is a ZY-3 panchromatic image of a second image pair of a comparative example of the present invention, fig. 6d is a GF-3SAR image of a second image pair of a comparative example of the present invention, fig. 6e is a summer *** image of a third image pair of a comparative example of the present invention, fig. 6f is a winter *** image of a third image pair of a comparative example of the present invention, fig. 6g is a *** optical image of a fourth image pair of a comparative example of the present invention, fig. 6h is a ZY-3 panchromatic image of a fourth image pair of a comparative example of the present invention, fig. 6i is a *** optical image of a fifth image pair of a comparative example of the present invention, fig. 6j is a GF-2 panchromatic image of a fifth image of a comparative example of a ***, fig. 6k is a sixth image of a sixth image pair of a *** optical image of a comparative example of the present invention, and fig. 6l is a grid image of a sixth image of a grid image of a comparative example of the present invention;

FIG. 7 is a schematic diagram of five-direction division according to a comparative example of the present invention;

fig. 8a is a schematic diagram of a fine matching result of a first group of image pairs according to a comparative example of the present invention, fig. 8b is a schematic diagram of a fine matching result of a second group of image pairs according to a comparative example of the present invention, fig. 8c is a schematic diagram of a fine matching result of a third group of image pairs according to a comparative example of the present invention, fig. 8d is a schematic diagram of a fine matching result of a fourth group of image pairs according to a comparative example of the present invention, fig. 8e is a schematic diagram of a fine matching result of a fifth group of image pairs according to a comparative example of the present invention, and fig. 8f is a schematic diagram of a fine matching result of a sixth group of image pairs according to a comparative example of the present invention;

fig. 9a is a schematic diagram of a purification result of a first group of image pairs according to a comparative example of the present invention, fig. 9b is a schematic diagram of a purification matching result of a second group of image pairs according to a comparative example of the present invention, fig. 9c is a schematic diagram of a purification result of a third group of image pairs according to a comparative example of the present invention, fig. 9d is a schematic diagram of a purification result of a fourth group of image pairs according to a comparative example of the present invention, fig. 9e is a schematic diagram of a purification result of a fifth group of image pairs according to a comparative example of the present invention, and fig. 9f is a schematic diagram of a purification result of a sixth group of image pairs according to a comparative example of the present invention;

fig. 10a is a matching effect of a LoFTR algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm, fig. 10b is a matching effect of a SuperPoint algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm, fig. 10c is a matching effect of a SIFT algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm, fig. 10d is a matching effect of a ContextDesc algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm, and fig. 10e is a matching effect of an FFM algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm;

fig. 11a is an enlarged view of a final registration result and a partial windowing of a first group of image pairs according to a comparative example of the present invention, fig. 11b is an enlarged view of a final registration result and a partial windowing of a second group of image pairs according to a comparative example of the present invention, fig. 11c is an enlarged view of a final registration result and a partial windowing of a third group of image pairs according to a comparative example of the present invention, fig. 11d is an enlarged view of a final registration result and a partial windowing of a fourth group of image pairs according to a comparative example of the present invention, fig. 11e is an enlarged view of a final registration result and a partial windowing of a fifth group of image pairs according to a comparative example of the present invention, and fig. 11f is an enlarged view of a final registration result and a partial windowing of a sixth group of image pairs according to a comparative example of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments.

Method embodiment

The embodiment provides a multi-source remote sensing image depth feature fusion matching method, which comprises the following specific steps:

2) And inputting the acquired remote sensing image pairs to be matched into a feature extraction network, and respectively extracting the features of each image to be matched by the feature extraction network to obtain the high-level features of each image to be matched and the fine fusion features of high-level fine positioning information and low-level global information.

Referring to fig. 1, the structure of the feature extraction network is as follows:

(1) for image input, converting an input original image H multiplied by W into a characteristic image H multiplied by W multiplied by 1;

(2) the down-sampling layers (3) and (4) are the first, second and third down-sampling layers from the lower layer to the upper layer, wherein (2) the 128-dimensional upper layer feature (H/2 xW/2 x 128) of the dimension 1/2 of the original image is extracted, (3) the 196-dimensional upper layer feature (H/4 xW/4 x 196) of the dimension 1/4 of the original image is further extracted, and (4) the 256-dimensional upper layer feature (H/8 xW/8 x 256) of the dimension 1/8 of the original image is further extracted;

(2) the down sampling of the (3) and (4) belongs to convolution operation on an original image, and in order to reduce parameters and calculation amount, the (2) and (3) and (4) all adopt convolution block structures which can simply execute identical mapping in the forward propagation process of a neural network, and the once convolution result and the three convolution results are stacked together to obtain a down sampling result. Furthermore, the structure does not affect the back propagation of the training process. As shown in fig. 2, by superimposing the input feature map f1 and the output feature map f2, this simple connection can improve the training effect of the added model without adding extra parameters and computation to the network.

(5) The (8) is a bilinear interpolation part from a high layer to a low layer, the (6) and the (9) are feature connection parts, wherein the (5) extracts 196-dimensional low-layer fine features (H/4 xW/4 x 196) of 1/4 dimension of an original image, and the (6) connects and fuses the 196-dimensional high-layer features of 1/4 dimension of the original image extracted in the (3) with the 196-dimensional low-layer fine features of 1/4 dimension of the original image obtained in the (5) to form the (7) first upsampling layer which fuses the features containing bottom layer positioning information and the high-layer feature upsampling results rich in semantic information to obtain 196-dimensional fusion features of 1/4 dimension of the original image; and then extracting 128-dimensional fusion features of the original dimension 1/2 through (8), and connecting and fusing the 128-dimensional high-level features of the original dimension 1/2 extracted in (2) and the 128-dimensional fusion features of the original dimension 1/2 obtained in (8) through (9) to form a third layer, namely a second upsampling layer for fusing the features containing the bottom layer positioning information and the upsampling result of the fusion features after the first fusion again to obtain the 128-dimensional fine fusion features of the original dimension 1/2.

The feature extraction network structure finally outputs a high-level feature map with the original dimension of 1/8 and a fine fusion feature map with the original dimension of 1/2 for fusion feature matching. The final feature extraction result is obtained through down-sampling and refined up-sampling of feature extraction and feature fusion of two different layers, and the information globality and the positioning accuracy of the features can be greatly improved.

(6) The structure of the characteristic connection part of (9) is shown in FIG. 3. Taking (6) as an example, the part up-samples the high-level feature F3 (H/8 × W/8 × 256) with richer semantic information to obtain F4 (H/4 × W/4 × 196), convolves the previous-level feature F1 (H/4 × W/4 × 196) of F3 with a 1 × 1 convolution kernel to obtain F2 (H/4 × W/4 × 196), and then connects the F4 feature to F2 to obtain a fused feature F5 (H/4 × W/4 × 196) with high localization accuracy and information globality. The connection mode can adaptively adjust the characteristic scale, so that the F2 and the F4 can adapt to the scale difference, and the dependence on the single scale of the image is eliminated. In addition, the operation can fuse the positioning detail information of the low-level features with the rich semantic information of the high-level features, and the fused features can greatly enhance the representation capability of the features of the multi-source remote sensing images and resist the geometric difference, the scale difference and the like among the multi-source remote sensing images.

In this embodiment, images a and B to be matched are input into the feature extraction network, and 256-dimensional high-level features F of 1/8 of original image dimension are finally extracted through feature extraction and fusion _{A_C} 、F _{B_C} And 128-dimensional fine fusion feature F of dimension 1/2 of original image _{A_F} 、F _{B_F} The feature map visualization is shown in fig. 4, wherein fig. 4a is H/8 × W/8 × 256 high-level feature map visualization, and fig. 4b is H/2 × W/2 × 128 fine fusion feature map visualization.

2) And inputting the high-level features of the images to be matched into a feature transformation module, and performing feature transformation fusion to obtain the high-level fusion features with correlation of each image to be matched.

Because the feature difference between the multi-source remote sensing images is extremely large, especially the difference between the homonymous features causes the similarity between the homonymous features to be low, the subsequent matching effect is seriously influenced, and the requirement of precise matching is difficult to meet, the feature transformation module of the embodiment processes the 256-dimensional high-level features of the dimension 1/8 of the original image obtained in the step 1) through the attention layer in the local feature transformer to obtain the high-level fusion features with correlation.

Before the feature transformation, the 256-dimensional high-level features F of the dimension 1/8 of the original image obtained in the step 1) are firstly processed _{A_C} 、F _{B_C} Sinusoidal coding is added to provide unique position information for each pixel, so that the features can be uniquely corresponding to the positions of the pixels on the original image, and the features of different levels have uniquely determined corresponding relations, so that the high-level features can be mapped to the fine fusion features of the lower layer in the subsequent process, the problem of mismatching caused by overhigh similarity between feature vectors of image sparse texture regions can be avoided, and the sparse texture region can be improvedAnd (5) matching effect.

Then, the high-level feature F added with the position information _{A_C} 、F _{B_C} And flattening the vector into one-dimensional vectors L1 and L2, wherein the length of each vector is H/8 multiplied by W/8. Inputting the L1 and the L2 into a characteristic transformation module; the feature transformation module improves the similarity between the high-level features of the two images by interleaving the attention layer, and the processing flow is shown in fig. 5.

Referring to fig. 5, the input vector of the attention layer may be similar to a Query vector Query, a Key vector Key, and a Value vector Value in dictionary search, and is intended to convert the features and further fuse neighborhood information of the image itself and feature information of the image to be matched. For a Self-concern Layer (Self Layer), query = Key = Value = L1, which is equivalent to inputting a feature with position information to fuse local neighborhood information of the Self-concern Layer to generate a new feature map; for a Cross Layer of interest (Cross Layer), query = L1 and Key = Value = L2, the features of the input reference image with position information are fused with the features of the image to be matched. Calculating attention weight W between the features through dot product of Query and Value, and retrieving information from the Value; in this example, elu (Q) +1 and elu (K) were used ^T ) +1 replaces Query and Key, achieving the purpose of reducing the calculation cost. Furthermore, the binding law using the tensor matrix is first calculated (elu (K) ^T ) + 1). V to simplify the calculation. The final formula is as follows:

W(Q,K,V)＝(elu(Q)+1)·[(elu(K ^T )+1)·V]

the interest layer selects the relevant information by measuring the similarity between the query vector and each key feature, and the output vector is the sum of the value vectors weighted by the similarity scores, so that the relevant information can be extracted from the value vectors with high similarity. In the Attention layer, the results of the Attention layer (Line Attention) are normalized and then superposed with L1 and L2 with position information to obtain one-dimensional characteristics F fusing the position information, the neighborhood information and the image information to be matched _{A_tr} 、F _{B_tr} . Because information loss and the like can exist in the normalization processing process of the attention layer, the obtained fusion feature F _{A_tr} 、F _{B_tr} Returning to L1 and L2 for processing again. Due to multiple source remote sensingThe image texture difference is large, the gray level difference is large, and the two high-level features need to have stronger correlation, so that the fusion features which are more reliable and have stronger correlation are obtained by increasing the interleaving processing times in the embodiment; the value of N may be set according to different matching accuracy requirements, and in a preferred embodiment, the interleaving number N =8, so that subsequent matching accuracy is the highest. Through the processing, the high-level fusion characteristic F for fusing the neighborhood information of the image information and the image information to be matched is output _{A_tr} 、F _{B_tr} The correlation between the high-level fusion features of the images A and B obtained after feature transformation is stronger, and the correlation between the non-homonymous features is weaker, so that the similarity of the homonymous features is higher than that of other features, and the mismatching condition is reduced.

3) Carrying out dense matching on all feature vectors on the high-level fusion features, and obtaining a coarse matching result according to the similarity between the feature vectors;

in the step 2), the two high-level fusion features have stronger correlation after feature transformation fusion, and dense matching corresponding to the pixel level of the high-level fusion features is established on the high-level fusion features to obtain a rough dense matching result. In the embodiment, the initial matching, namely the rough matching, among the multi-source remote sensing images is performed through the optimal matching layer, and the key point of the rough matching is the similarity contrast among the features.

Because the output vector of the feature transformation fusion device is the sum of the value vectors weighted by the similarity scores and carries the similarity information, the similarity between all the feature vectors can be represented by a score matrix between the vectors, and if the similarity is greater than a score threshold, the similarity between the two vectors is high, so that the two vectors can be regarded as correct matching.

In this embodiment, the high-level fusion feature F _{A_tr} 、F _{B_tr} The scoring matrix S in between is determined by:

wherein <, > represents the inner product; a × B is all possible corresponding relationships between pixels in the images a and B to be matched, and if the pixel in a is 640 × 480 and the pixel in B is 500 × 300, there may be 640 × 480 × 500 × 300 corresponding relationships between the pixel in a and the pixel in B, i is one of 640 × 480 pixels in the image a to be matched, and j is one of 500 × 300 pixels in the image B to be matched.

The scoring matrix is calculated for all possible matches, i.e. all possible correspondences of pixels in the images a, B to be matched

By maximizing the total score sigma _i,j S _i,j p _i,j To obtain the optimal allocation matrix P. The optimal distribution matrix P can represent the optimal correspondence between the high-level fusion feature vector in the image a to be matched and the high-level fusion feature vector in the image B to be matched.

In this embodiment, the optimal distribution matrix P may be calculated by an entropy regularization formula of an optimal transmission algorithm, and the optimal transmission may effectively solve P by a Sinkhorn algorithm. Finally, a mutual nearest neighbor criterion (MNN) is executed to filter the matches which may have abnormity, and the two selection criteria are combined to obtain a relatively reliable and uniformly distributed matching result.

Since the number of sparse texture region or single texture region matching point pairs is small in the multi-source remote sensing image matching process, more high-score matching point pairs cannot be selected, so that the characteristics of the sparse texture region or the single texture region participating in matching are insufficient, and an ideal matching effect is difficult to achieve. Therefore, a sliding window adaptive score threshold detection algorithm is established for increasing the coarse matching result of the sparse texture region, which specifically comprises the following steps:

firstly, setting an initial score threshold value as theta, setting the area of a sliding window, a horizontal sliding step length and a vertical sliding step length, and carrying out sliding detection on the score of a high-level fusion feature vector with correlation; and if the score s of the feature vector existing in the current window is larger than or equal to theta, the window continues to slide.

When the scores s of all high-level fusion feature vectors in the current window are less than theta, calculating an adaptive threshold value avg theta in the matched sparse node in the window; and traversing the high-level fusion feature vector in the window, and if the score s of the vector existing in the current window is larger than avg theta, adding the vector into the coarse matching point set, and then continuing sliding the window.

Repeating the above operations until the window slides and respectively traverses the high-level fusion characteristics F _{A_tr} 、F _{B_tr} Outputting the high-level fusion feature F _{A_tr} 、F _{B_tr} The coarse matching point set.

Wherein, the calculation formulas of the area of the sliding window, the horizontal sliding step length and the vertical sliding step length are as follows

Wherein ws is the area of the sliding window, hl is the horizontal sliding step length, and vl is the vertical sliding step length;

the calculation formula of the adaptive threshold value avg theta in the matched sparse node is as follows

Wherein n is the number of eigenvectors in the sliding window, s _i The matching scores of the feature vectors in the sliding window are obtained.

Since the number of the matching point pairs in the sparse texture region or the single texture region is less in the multi-source remote sensing image matching process, the number of the high-score matching point pairs is possibly lower than that in the dense matching region, the score threshold of the matching sparse region is reduced adaptively, the low-score matching point pairs which are increased in the matching sparse region can be screened, and the matching data is supplemented; meanwhile, the score threshold is adaptively reduced according to the feature sparsity of the region, so that redundant low-score point pairs of the dense matching region cannot be selected, and errors are avoided.

In this embodiment, the outdoor scene in the MegaDepth dataset is also used as a training set to train the coarse matching result. MegaDepth is a large depth data set generated from a large number of internet pictures and used for monocular depth estimation, and comprises about one hundred thousand outdoor three-dimensional scenes, the three-dimensional scenes can generate stereopairs with strict transformation relation and camera parameters thereof, and image points in the stereopairs have one-to-one matching pixel relation. Calculating the real match of the actual scene in the training set through the corresponding relation

And as a true value, combining an allocation matrix value representing a coarse matching result, and minimizing a difference value between the allocation matrix value and the true value to obtain an optimal matching result, namely a best match, of the coarse matching, so that the stability and reliability of the coarse matching are finally improved. According to the above principle, the loss function is shown as follows:

in the formula (I), the compound is shown in the specification,

in order to calculate the value of the allocation matrix,

a matrix truth value is assigned. The training has the function of enabling a rough matching result to continuously approach a known real matching result, mainly training data with ground real matching information such as illumination, large scale difference, day and night images and the like, and learning the real matching relation.

4) And mapping the coarse matching result to the fine fusion characteristic to carry out calibration optimization on the dense matching to obtain a fine matching result.

Because the coarse matching result is obtained under the resolution of 1/8 of the original image, when the coarse matching result is mapped to the original dimension, the position of the coarse matching result may have an error, that is, two high-level descriptors are extremely similar but may not be the most similar, and the error of a plurality of pixels exists in a local window, for example, the coarse matching obtained by the high-level feature resolution may be matching between feature vectors extracted by an 8 × 8 pixel area, and cannot be accurately positioned to a pixel level, and the high-level feature description may have an error for a multisource remote sensing image with a large difference. Therefore, the feature points obtained by rough matching in the step 3) are positioned in the fine fusion features obtained in the step 1) for calibration optimization, further fine matching is realized, and a multi-source remote sensing image fine matching result with higher resolution is obtained.

Specifically, when fine matching is performed, feature F is fused at a high level respectively by taking N feature points in the coarse matching point set screened in the step 3) as centers _{A_tr} 、F _{B_tr} Cutting up N pairs of local windows with the size of m multiplied by m; correspondingly mapping the N pairs of local windows to the fine fusion characteristics F of the images A and B to be matched _{A_F} 、F _{B_F} Obtaining N pairs of local fine windows with the coarse matching characteristic point pairs as centers; inputting N pairs of local fine windows into a feature transformation module, and transforming the windows for a plurality of times to generate N pairs of local fine fusion feature maps of images A and B with rough matching feature point pairs as centers

Wherein, the value of m can be specifically set according to actual needs: if the time requirement is high, the value of m is set to be about 5; if the final precision requirement on the matching result is higher, setting the precision requirement on the matching result to be about 8; if the hardware equipment is poor, the setting is about 3, and the running memory is small; m is set to 5 in this embodiment.

Then each will

The feature vector corresponding to the center point P of (2) and the feature vector corresponding to the center point P of (2)

In pairs

All vectors in (a) are correlated to generate the

A match probability distribution expectation value of each pixel in (b) with P; the expected value of the matching probability distribution is calculated as follows:

wherein V _A (P) is

Feature vector of center point P, V _B (x) Is composed of

The feature vector of a certain pixel point x, y is the pixel gradient of the pixel point x on the image B; the highest probability value obtained by calculation

The pixel point of (a) is a fine matching result of the point P on the image a with sub-pixel accuracy on the image B, and the result is taken as a final matching result.

In order to ensure the accuracy of the matching result, the precise matching result needs to be subjected to mismatching inspection and elimination again. The method adopts a progressive consistent sampling algorithm, namely a PROSAC algorithm to carry out mismatch elimination. The PROSAC algorithm is to sample from the continuously increased optimal matching point pair set, and although the algorithm is easily influenced by excessive mismatching points, the algorithm becomes extremely unstable, but the number of the existing mismatching points is very small through the combined processing of coarse matching and fine matching checking optimization, so that the method for eliminating the mismatching points by using the PROSAC method has strong adaptability.

Comparative example

In the comparative example, an experiment is performed on a multi-source remote sensing image depth feature fusion matching method (hereinafter, collectively referred to as an FFM algorithm) in a method embodiment in an ubuntu18.04 operating system, a programming language environment is python3.6, and a programming platform is Pycharm. The hardware platform uses a notebook computer carrying an I7 CPU, a 31G memory and a GeForce RTX 2060 video card (video memory 6 GB).

Six pairs of multi-source remote sensing images are selected for testing in the comparative example, wherein the first group of images are optical images of unmanned aerial vehicles and thermal infrared images of unmanned aerial vehicles, the second group of images are ZY-3 panchromatic images and GF-3SAR images, the third group of images are images of Google in summer and images of Google in winter, the fourth group of images are optical images of Google and ZY-3 panchromatic images, the fifth group of images are optical images of Google and GF-2 panchromatic images, and the sixth group of images are optical images of Google and images of OSM grid maps. The specific images are shown in fig. 6a-6l, wherein fig. 6a is the drone optical image of the first group image pair, fig. 6b is the drone thermal infrared image of the first group image pair, fig. 6c is the ZY-3 panchromatic image of the second group image pair, fig. 6d is the GF-3SAR image of the second group image pair, fig. 6e is the summer *** image of the third group image pair, fig. 6f is the winter *** image of the third group image pair, fig. 6g is the *** optical image of the fourth group image pair, fig. 6h is the ZY-3 panchromatic image of the fourth group image pair, fig. 6i is the *** optical image of the fifth group image pair, fig. 6j is the GF-2 panchromatic image of the fifth group image pair, fig. 6k is the *** optical image of the sixth group image pair, and fig. 6l is the OSM raster map image of the sixth group image pair.

The comparative analysis of six groups of multi-source remote sensing image data is shown in table 1:

TABLE 1 comparative analysis of test data

The performance of the matching algorithm is evaluated by adopting the correct matching point number (P), the Matching Accuracy (MA), the Root Mean Square Error (RMSE) of the matching point and the matching time (t). Since the matching algorithm of the present comparative example focuses on obtaining a more uniform matching result, the degree of uniformity of the distribution of the matching result is measured by the degree of uniformity of the distribution of the matching points (RSD).

The correct matching points are the number of points with the difference between the actual position of the feature points on the image to be matched and the actual position of the feature points on the reference image within a threshold value, and are obtained by the following verification formula:

in the formula, H is a real affine transformation model of replacing two multisource remote sensing images by an affine transformation model fitted by artificial points, and the characteristic point (x) ^′ _i ,y ^′ _i ) After affine transformation, the point (x) with the same name as the affine transformation point _i ,y _i ) If the distance of (2) is less than the threshold value epsilon, judging that the distance is a correct matching point; in the present comparative example, the threshold value was set to 3. The number (P) of correct matching points refers to the number of matching points meeting the above conditions, and the index can reflect the basic performance of the feature matching algorithm.

The matching accuracy rate is the ratio of the number of correct matching points to the number of all matching points, and the index can reflect the performance of successful matching of the algorithm.

The root mean square error of the matching point is the result x of the affine transformation of the correct matching point ^′ The square root of the ratio of the sum of the squares of the differences from the true x to the number n of correct matches is given by:

the distribution uniformity of the matching points is calculated according to the distribution uniformity of the matching results in five directions, and the image is divided into ten regions in total in five directions, as shown in fig. 7. The matching point error estimation is an important index for measuring the matching effect, and the root mean square error is very sensitive to the response of extra-large or extra-small errors in a group of transformations, so that the root mean square error can well reflect the accuracy of the matching result of the multi-source remote sensing image. The true value used by the root mean square error of the matching point is the real pixel coordinate, and the deviation does not exist, so that the method is more suitable for estimating the error of the matching point.

According to the statistical principle, the sample variance is used to represent the difference of the number of matching points in the image blocks in five different directions, if the distribution of the matching point pairs in the five directions is relatively uniform, the sample variance of the number of the matching point pairs in the five directions is relatively small, otherwise, the sample variance is relatively large. The uniformity of the distribution of the matching points is shown as follows:

in the formula, V is a region statistical distribution vector, and the vector is formed by combining the number of matching points in ten regions. The larger the uniformity of the distribution of the matching points is, the more uniform the distribution of the matching points is proved, otherwise, the non-uniform the distribution of the matching points is.

A plurality of representative algorithms which can be used for multi-source remote sensing image matching are selected in the test for comparison and analysis, wherein the algorithms comprise a SuperPoint algorithm, a ContextDesc algorithm, a LoFTR algorithm and a SIFT classic algorithm based on deep learning characteristics. The SuperPoint algorithm is a deep learning self-monitoring algorithm for extracting feature points and descriptors; the ContextDesc algorithm is a deep learning matching algorithm specially designed for multi-mode images, and the ContextDesc is an original feature descriptor such as DELF (Deltoid-class analysis) enhanced by high-level image visual information and geometric information of key point distribution. SIFT, which is a scale-invariant feature transform, is a local feature descriptor with certain affine invariance and interference resistance. The above four algorithms and the FFM algorithm of this comparative example are applied to the above 6 sets of image pairs to perform matching tests, and the matching test result pairs are shown in table 2.

TABLE 2 comparison of matching test results

As can be seen from table 2, the FFM algorithm achieves good matching effects on six pairs of multi-source remote sensing images, and obtains a sufficient number of correct matching points within a dominant time.

As can be seen from comparison of the table 2, for multi-source remote sensing image pairs in different modes, the FFM algorithm can obtain more correct matching point numbers, and the FFM algorithm has advantages and disadvantages with the LoFTR algorithm in different image pairs, but the number of the FFM algorithm is far higher than that of the three algorithms of other algorithms. Due to the fact that the gray level difference of the multi-source remote sensing image is large and the local gradient information of key points is inconsistent, the SIFT algorithm fails to match the visible light image with the thermal infrared image, the panchromatic image with the SAR image and the optical image with the raster map. Compared with the SIFT algorithm, the FFM algorithm is more stable in matching multi-source remote sensing images with large gray difference and inconsistent local gradient information. The SuperPoint algorithm is greatly improved in correct matching point number, matching point root mean square error and time relative to SIFT and ContextDesc algorithms, and shows that the SuperPoint algorithm has stronger adaptability to multi-source remote sensing images, but the performance of the SuperPoint algorithm is generally lower than that of the FFM algorithm. The ContextDesc algorithm is used for integrating multiple features for matching, but the matching effect is relatively poor, which shows that the ContextDesc algorithm does not have full adaptability to multi-source remote sensing images with larger differences, and the ContextDesc algorithm fails to match panchromatic images and SAR images and optical images and raster maps, which shows that the algorithm has poor resistance to nonlinear radiation distortion and local gradient information difference of the multi-source remote sensing images.

From the root mean square error of the matching points, the LoFTR algorithm and the SuperPoint algorithm perform well compared with the SIFT algorithm and the ContextDesc algorithm, but have a certain difference with the FFM algorithm, particularly on a full-color image and an SAR image, and the comparison shows that the feature positioning precision of the FFM algorithm is higher. The FFM algorithm is inferior to the SuperPoint algorithm and the LoFTR algorithm in terms of time due to two stages of the sliding window search detection algorithm and the matching detection algorithm of the initial matching. The RMSE results tested by the FFM algorithm on the six sets of images have a certain difference, compared with the third set of data, the first set of data and the second set of data have more buildings, and because the buildings on different images have different projection parallaxes, the building regions between the images have larger local deformation, and such local geometric deformation is difficult to eliminate by an affine transformation model, so the RMSE of the matching results is relatively large.

Aiming at the distribution uniformity of the matching points, the FFM algorithm and the LoFTR algorithm are emphatically compared because the number of correct matching points of other algorithms is less, and the comparison result is shown in Table 3.

TABLE 3 comparison of uniformity of distribution of matching points

As can be seen from table 3, the uniformity of the matching points of the FFM algorithm on the six sets of multi-source remote sensing image pairs is greater than LoFTR, the uniformity of the matching points adopts logarithmic operation, and is reflected to the variance of five-direction distribution, the uniformity of the matching points of the FFM is obviously superior to that of the LoFTR, and the experiment proves the effectiveness of the sliding window adaptive score detection algorithm in detecting the characteristics of the matched sparse region. After the fine matching is completed in this comparative example, the matching results are shown in fig. 8a to 8 f; fig. 8a is a schematic diagram of a fine matching result of a first group of image pairs, fig. 8b is a schematic diagram of a fine matching result of a second group of image pairs, fig. 8c is a schematic diagram of a fine matching result of a third group of image pairs, fig. 8d is a schematic diagram of a fine matching result of a fourth group of image pairs, fig. 8e is a schematic diagram of a fine matching result of a fifth group of image pairs, and fig. 8f is a schematic diagram of a fine matching result of a sixth group of image pairs.

Therefore, the FFM algorithm has strong adaptability to the multi-source remote sensing image, a considerable number of matching point pairs are obtained, and the distribution of the characteristic points is uniform. For a small number of mismatching points still existing, the mismatching is removed by using a PROSAC algorithm to achieve the purpose of purifying the matching point pairs, and the purification results are shown in fig. 9a-9f, where fig. 9a is a schematic diagram of the purification results of a first group of image pairs, fig. 9b is a schematic diagram of the purification matching results of a second group of image pairs, fig. 9c is a schematic diagram of the purification results of a third group of image pairs, fig. 9d is a schematic diagram of the purification results of a fourth group of image pairs, fig. 9e is a schematic diagram of the purification results of a fifth group of image pairs, and fig. 9f is a schematic diagram of the purification results of a sixth group of image pairs.

Therefore, the PROSAC algorithm can effectively eliminate the mismatching point pairs, and the finally reserved matching point pairs are uniformly distributed to a greater extent, so that a good foundation is laid for subsequent image registration, fusion and other work.

In order to more intuitively show the performance among several algorithms, the matching effect of the LoFTR algorithm, the SuperPoint algorithm, the ContextDesc algorithm, the SIFT algorithm and the FFM algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown through FIGS. 10a to 10 e. The matching effect of the LoFTR algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10a, the matching effect of the SuperPoint algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10b, the matching effect of the SIFT algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10c, the matching effect of the ContextDesc algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10d, and the matching effect of the FFM algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10 e.

The results of fig. 10a to 10e show that, for a general optical image and a thermal infrared image, and an optical image and a full-color image, FFM can better overcome the problem of difficult matching caused by gray gradient difference and scale difference compared with SuperPoint, contextDesc and SIFT, and obtain a considerable number of correct matching point pairs. For optical images of different time phases, the FFM has great advantages in vegetation difference areas, and the FFM benefits from the learning of the ground truth relation among the feature vectors in the training process.

The multi-source remote sensing image registration is one of the important purposes of image matching, so that the quality of a matching result can be visually indicated by the quality of a registration effect. The purified matching point pairs are used for a multi-source image registration test, affine transformation parameters are calculated through the matching point pairs to correct and register multi-source images, and a final registration result and a local windowing enlarged view are shown in fig. 11a-11f, wherein fig. 11a is the final registration result and the local windowing enlarged view of a first group of image pairs, fig. 11b is the final registration result and the local windowing enlarged view of a second group of image pairs, fig. 11c is the final registration result and the local windowing enlarged view of a third group of image pairs, fig. 11d is the final registration result and the local windowing enlarged view of a fourth group of image pairs, fig. 11e is the final registration result and the local windowing enlarged view of a fifth group of image pairs, and fig. 11f is the final registration result and the local windowing enlarged view of a sixth group of image pairs.

As can be seen from fig. 11a to 11f, the FFM algorithm has strong adaptability to the registration of the visible light image and the thermal infrared image, the panchromatic image and the SAR image. The method has the advantages that accurate registration is realized in local areas with large gray difference and obvious ground object difference, the registration error of each area is basically controlled within 3 pixels, and the registration result shows that the position accuracy of the result obtained by the FFM algorithm is high, the distribution is uniform, and the method has strong performance.

According to the method, when the characteristics are extracted in the early stage, the acquired fine fusion characteristics of the image are enabled to have high-level information and low-level information simultaneously through characteristic extraction fusion, so that the positioning accuracy and the global property and the anti-interference capability can be guaranteed; and moreover, the extracted high-level features are subjected to feature transformation fusion, so that the similarity between matching points in the high-level features of the image to be matched is improved, and the matching result is more reliable. When later-stage features are matched, the high-level fusion features are compared for rough matching to obtain a matching result more conforming to the global features, and then the fine fusion features are compared to correct the matching result, so that the reliability of the global aspect and the precision aspect of the matching result is considered, the resolution is higher, and the positioning is more accurate. Sinusoidal coding is carried out on all high-level feature vectors before feature transformation fusion is carried out, so that the features of different levels have unique determined corresponding relation, and the problem of mismatching caused by overhigh similarity among feature vectors of image sparse texture regions is avoided; and when rough matching is carried out, the score threshold is also reduced through the adaptability of the sliding window, so that the number of the matching points screened out in the sparse region is increased, and the matching effect of the sparse region can be comprehensively improved from two aspects.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A multi-source remote sensing image depth feature fusion matching method is characterized by comprising the following steps:

2) Inputting the obtained remote sensing image pairs to be matched into a feature extraction network, and respectively extracting features of each image to be matched by the feature extraction network to obtain high-level features of each image to be matched and fine fusion features of high-level fine positioning information and low-level global information;

5) And mapping the coarse matching result to the fine fusion characteristic to carry out calibration optimization on the dense matching to obtain a fine matching result.

2. The multi-source remote sensing image depth feature fusion matching method according to claim 1, wherein the feature extraction network comprises three down-sampling layers and two up-sampling layers;

the first down-sampling layer is used for down-sampling the input image to be matched to obtain a high-level feature map with the original dimension of 1/2 of the image to be matched; the second down-sampling layer is used for down-sampling the input high-level feature map of the original dimension 1/2 of the image to be matched to obtain the high-level feature map of the original dimension 1/4 of the image to be matched; the third down-sampling layer is used for down-sampling the input high-level feature map with the original dimensionality of 1/4 of the image to be matched to obtain the high-level feature map with the original dimensionality of 1/8 of the image to be matched;

and outputting a high-level feature map of the original dimension 1/8 and a fine fusion feature map of the original dimension 1/2 of each image to be matched through the three down-sampling layers and the two up-sampling layers for fusion feature matching.

3. The multi-source remote sensing image depth feature fusion matching method according to claim 1, wherein the fusion feature matching of the feature transformation module specifically comprises the following steps:

(1) adding position information to a high-level feature map with 1/8 of the original dimension of the image to be matched, and enabling the feature to be uniquely corresponding to the position of the feature on the original image;

(2) flattening a high-level feature map with position information of an image to be matched into a one-dimensional vector, inputting the one-dimensional vector to a feature transformation module, performing multiple interweaving processing through a concerned layer, and outputting high-level fusion features with correlation of each image to be matched;

the attention layers comprise a self-attention layer, a cross-attention layer and an attention layer; the self-attention layer is used for fusing local neighborhood information of the self-attention layer with the characteristics with position information of the reference image in the input image pair to be matched to generate a new characteristic diagram; the cross attention layer fuses the input reference image feature with position information and the feature of the other image to be matched in the pair of images to be matched; the attention layer selects the feature information of the vector with high similarity by comparing the similarity between the input query vector and each key feature, and the selected result is subjected to normalization processing and then is superposed with the flattened one-dimensional vector of the high-level feature map with the position information of each image to be matched to obtain fused position information, neighborhood information and fused features of the image information to be matched; and the multiple interweaving processing refers to inputting the obtained fusion features into a concerned layer to carry out interweaving processing again, repeating the process and finally outputting the high-level fusion features of the images to be matched.

4. The multi-source remote sensing image depth feature fusion matching method according to claim 2, wherein the three downsampling layers all adopt a convolution block structure, element superposition is carried out on the input feature map and the output feature map, and the feature map obtained after superposition is used as a downsampling result.

5. The multi-source remote sensing image depth feature fusion matching method according to claim 3, wherein the adding position information is adding sinusoidal codes to each pixel feature.

6. The multi-source remote sensing image depth feature fusion matching method according to claim 3, wherein the rough matching represents the similarity between all feature vectors through a score matrix between high-level fusion feature vectors, and if the similarity is greater than a score threshold, the similarity is regarded as correct matching;

the scoring matrix S between the vectors is determined by where < > represents the inner product;

calculating a scoring matrix for all possible matching modes

By maximizing the total score ∑ _i，j S _i，j P _i，j To obtain an optimal allocation matrix P; the optimal distribution matrix P can represent the optimal correspondence between the high-level fusion feature vector in the image a to be matched and the high-level fusion feature vector in the image B to be matched.

7. The multi-source remote sensing image depth feature fusion matching method according to claim 6, wherein the coarse matching through the score matrix adopts a sliding window adaptive score threshold detection algorithm, and specifically comprises the following steps:

a) Setting an initial score threshold value as theta, setting the area of a sliding window, a horizontal sliding step length and a vertical sliding step length, and performing sliding detection on the score of the high-level fusion feature vector;

b) If the scores s of all high-level fusion feature vectors in the current window are less than theta, calculating an adaptive threshold value avg theta in the matched sparse node in the window; traversing the high-level fusion feature vectors in the window, if the score s of the vector existing in the current window is larger than avg theta, adding the vector into the coarse matching point set, and continuously sliding the window;

c) If the score s of the feature vector existing in the current window is larger than or equal to theta, the window continues to slide;

d) Repeating the steps until the window slides and traverses the high-level fusion features with correlation of the images to be matched, and outputting a coarse matching point set;

8. The multi-source remote sensing image depth feature fusion matching method according to claim 1, wherein the process of mapping the coarse matching result to the low-level features to perform calibration optimization on the dense matching is as follows:

i) Taking N pairs of feature points on the high-level fusion features of each image to be matched as a center, wherein the N pairs of feature points refer to a coarse matching point set screened out after coarse matching; respectively cutting N pairs of local windows with the size of m multiplied by m on the corresponding high-level fusion characteristics;

II) mapping the N pairs of windows to the fine fusion features of the image to be matched to obtain N pairs of local fine and fine features with the coarse matching feature point pairs as centersWindow, inputting N pairs of local fine windows into the feature transformation module, transforming the windows several times to generate N pairs of local fine fusion feature maps of images A and B with rough matching feature point pairs as centers

III) mixing each

The feature vector corresponding to the center point P of (a) and (b)

All vectors in (1) are correlated to generate

A desired value of a matching probability distribution of each pixel of (a) to P; the expected value of the matching probability distribution is calculated as follows:

wherein V _A (P) is

Feature vector of center point P, V _B (x) Is composed of

The feature vector of a certain pixel point x, y is the pixel gradient of the pixel point x on the image B; the pixel point with the highest probability value obtained by calculation is a fine matching result of the point P on the image A with sub-pixel precision on the image B, and the result is taken as a final matching result.

9. The multi-source remote sensing image depth feature fusion matching method according to claim 1, wherein after the fine matching result is obtained, a PROSAC algorithm is further adopted to perform mismatching check and elimination on the fine matching result again.