CN115240079A - Multi-source remote sensing image depth feature fusion matching method - Google Patents

Multi-source remote sensing image depth feature fusion matching method Download PDF

Info

Publication number
CN115240079A
CN115240079A CN202210792899.XA CN202210792899A CN115240079A CN 115240079 A CN115240079 A CN 115240079A CN 202210792899 A CN202210792899 A CN 202210792899A CN 115240079 A CN115240079 A CN 115240079A
Authority
CN
China
Prior art keywords
feature
image
matching
matched
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210792899.XA
Other languages
Chinese (zh)
Inventor
蓝朝桢
王龙号
施群山
周杨
张衡
李鹏程
吕亮
胡校飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN202210792899.XA priority Critical patent/CN115240079A/en
Publication of CN115240079A publication Critical patent/CN115240079A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of image feature matching, and particularly relates to a multi-source remote sensing image depth feature fusion matching method. According to the method, through feature extraction, fine fusion features with high-level and low-level information are obtained, feature transformation is performed on the high-level features, the similarity between matching points in the high-level features of the image to be matched is improved, and the matching result is more reliable. When later-stage features are matched, the high-level features are compared to perform rough matching to obtain a global matching result, and then the fine fusion features are used for correction, so that the matching result gives consideration to the reliability in the global aspect and the precision aspect, the resolution is higher, and the positioning is more accurate. The high-level features are subjected to sinusoidal coding before feature transformation, so that unique corresponding relation exists among the features, and mismatching caused by overhigh similarity among the features of sparse texture regions is avoided; and when coarse matching is carried out, the score threshold is also reduced through the adaptability of the sliding window, the number of the matched points screened out in the sparse region is increased, and the matching effect of the sparse region is further improved.

Description

Multi-source remote sensing image depth feature fusion matching method
Technical Field
The invention belongs to the field of image feature matching, and particularly relates to a multi-source remote sensing image depth feature fusion matching method.
Background
The multisource remote sensing images in the same area usually contain different characteristics of the same ground feature information, so the ground feature information can be fully displayed by combining the multisource remote sensing images, and the premise of jointly processing the multisource remote sensing images to discover the ground feature remote sensing information is to match the multisource images so as to identify the same-name points of two or more multisource images.
The current multi-source remote sensing image matching method usually applies a convolutional neural network to match the multi-element influence, such as CMM-Net and LoFTR algorithm. The CMM-Net algorithm extracts a high-dimensional feature map of a multi-source image by using a convolutional neural network, selects feature points according to the principle that two conditions of maximum channel and maximum local channel are simultaneously met, and finally completes multi-source image matching; the LoFTR algorithm strengthens the characteristic correlation through a Transformer, and obtains a dense matching result through rough matching and fine matching, and the matching effect in a sparse texture area is excellent, but the characteristic extraction network is limited by a single image scale, so that the requirements of matching in multiple aspects cannot be met.
Therefore, the existing various multi-source image feature matching methods still have various defects of difficult feature characterization, larger feature vector similarity difference or difficult matching caused by small number of features in sparse texture regions of the multi-source remote sensing images, and the like, so that the accuracy of the final matching result is influenced. Therefore, it is necessary to provide a remote sensing feature matching scheme capable of solving the above problems.
Disclosure of Invention
The invention aims to provide a depth feature fusion matching method for a multi-source remote sensing image, which is used for solving the problems that the accuracy of a final matching result is influenced due to the difficulty in feature characterization, the large similarity difference of feature vectors, the difficulty in matching caused by the small number of features in sparse texture regions of the multi-source remote sensing image and the like in the conventional multi-source image feature matching method.
In order to achieve the purpose, the invention provides a depth feature fusion matching method for a multi-source remote sensing image, which comprises the following steps:
1) Constructing a matching model, wherein the matching model comprises a feature extraction network, a feature transformation module, a dense matching module and a calibration optimization module;
2) Inputting the obtained remote sensing image pairs to be matched into a feature extraction network, and respectively extracting the features of each image to be matched by the feature extraction network to obtain the high-level features of each image to be matched and the fine fusion features of high-level fine positioning information and low-level global information;
3) Simultaneously inputting the high-level features of the obtained image pair to be matched into a feature transformation module, and performing feature transformation fusion on each image to be matched by fusing neighborhood information of the image and the high-level features of the image to be matched to obtain the high-level fusion features with correlation of each image to be matched in the image pair to be matched;
4) Carrying out dense matching on all the feature vectors on the high-level fusion features of the obtained image pair to be matched, and obtaining a coarse matching result according to the similarity between the feature vectors;
5) And mapping the coarse matching result to the fine fusion feature to perform calibration optimization on the dense matching to obtain a fine matching result.
The method has the beneficial effects that: during the feature extraction in the previous period, the obtained fine fusion features of the image have high-level information and low-level information simultaneously through feature extraction and fusion, so that the positioning accuracy and the global property and the anti-interference capability can be ensured; and the extracted high-level features are subjected to feature transformation fusion to obtain high-level fusion features so as to improve the similarity between matching points in the high-level features of the image to be matched and ensure that the matching result is more reliable. When the later-stage features are matched, the high-level fusion features are compared for rough matching to obtain a matching result more conforming to the global features, and then the fine fusion features are compared to correct the matching result, so that the matching result gives consideration to the reliability in the aspects of global and precision, the resolution is higher, and the positioning is more accurate.
Further, the feature extraction network comprises three down-sampling layers and two up-sampling layers;
the first down-sampling layer is used for down-sampling the input image to be matched to obtain a high-level feature map with the original dimension of 1/2 of the image to be matched; the second down-sampling layer is used for down-sampling the input high-level feature map of the original dimensionality 1/2 of the image to be matched to obtain a high-level feature map of the original dimensionality 1/4 of the image to be matched; the third down-sampling layer is used for down-sampling the input high-level feature map of the original dimension 1/4 of the image to be matched to obtain the high-level feature map of the original dimension 1/8 of the image to be matched;
the first up-sampling layer is used for up-sampling an input high-level feature map with original dimension 1/8 of an image to be matched into a low-level feature map with original dimension 1/4 of the image to be matched, meanwhile, performing convolution processing on the high-level feature map with original dimension 1/4, and then fusing the low-level feature map with original dimension 1/4 and the high-level feature map with original dimension 1/4 to obtain a fused feature map with original dimension 1/4; the second upsampling layer is used for upsampling the input fusion feature map with the original dimension of 1/4 into a low-level feature map with the original dimension of 1/2, simultaneously performing convolution processing on the high-level feature map with the original dimension of 1/2, and then fusing the low-level feature map with the original dimension of 1/2 and the high-level feature map with the original dimension of 1/2 to obtain a fine fusion feature map with the original dimension of 1/2;
and outputting a high-level feature map of the original dimension 1/8 and a fine fusion feature map of the original dimension 1/2 of each image to be matched for fusion feature matching through the three down-sampling layers and the two up-sampling layers. In order to improve the information globality, the positioning accuracy and the anti-interference capability of the multi-source remote sensing image characteristics, when the characteristics are extracted, the high-level characteristics and the low-level characteristics are fused twice, so that the finally obtained fine fusion characteristic diagram comprises the information of the characteristics of each layer, and the richer the contained characteristic information is, the higher the information globality and the positioning accuracy of the characteristics are.
Further, the fusion feature matching of the feature transformation module specifically comprises the following steps:
1) Adding position information to a high-level feature map with 1/8 of the original dimension of an image to be matched to enable the feature to be uniquely corresponding to the position of the feature on the original image;
2) Flattening a high-level feature map with position information of an image to be matched into a one-dimensional vector, inputting the one-dimensional vector to a feature transformation module, performing multiple interweaving processing through a concerned layer, and outputting high-level fusion features with correlation of each image to be matched;
the attention layer comprises a self-attention layer, a cross-attention layer and an attention layer; the self-attention layer is used for fusing local neighborhood information of the self-attention layer with the characteristics with position information of the reference image in the input image pair to be matched to generate a new characteristic diagram; the cross attention layer fuses the input reference image feature with position information and the feature of the other image to be matched in the image pair to be matched; the attention layer selects related information by measuring the similarity between the query vector and each key feature, and the selected result is subjected to normalization processing and then is superposed with the flattened one-dimensional vector of the high-level feature map with the position information of each image to be matched to obtain fusion position information, neighborhood information of the image to be matched and fusion features of the image information to be matched; and the multiple interweaving processing refers to inputting the obtained fusion features into a concerned layer for interweaving again, repeating the process, and finally outputting the high-layer fusion features of the images to be matched.
Because the correlation between the homonymous features of the multi-source remote sensing images is often low, the similarity between the matching points is low when matching is carried out, so that the matching points are difficult to accurately identify, therefore, the correlation of the high-level features of the images to be matched is improved through feature transformation, the accuracy of matching point identification can be improved, and a more reliable matching result is obtained.
Furthermore, in order to reduce parameters and calculation amount while ensuring the downsampling precision, the three downsampling layers all adopt a convolution block structure, the input feature diagram and the output feature diagram are subjected to element superposition, and the feature diagram obtained after superposition is used as a downsampling result.
The convolution block structure can simply execute the identity mapping in the forward propagation process of the neural network, and the one-time convolution result and the three-time convolution result are stacked together to be layered to obtain the down-sampling result, so that the identity connection mode does not increase extra parameters or computational complexity, and meanwhile, the training efficiency can be improved and the sampling precision can be ensured. Furthermore, the structure does not affect the back propagation of the training process.
Furthermore, in order to improve the matching effect of the multi-source image sparse texture region, the adding position information is to add sinusoidal codes to each pixel feature.
Because the sinusoidal coding can provide unique position information for each pixel, the features are uniquely corresponding to the positions of the pixels on the original image, and the features of different levels have uniquely determined corresponding relation, the problem of mismatching caused by overhigh similarity among feature vectors of sparse texture regions of the image is avoided, and the matching effect of the sparse texture regions is improved.
Further, the rough matching represents the similarity between all the feature vectors through a score matrix between the high-level fusion feature vectors, and if the similarity is greater than a score threshold value, the similarity is regarded as correct matching;
the scoring matrix S between the vectors is determined by the following formula, where <, > represents the inner product;
Figure BDA0003731006350000041
wherein, F A_tr 、F B_tr The image A and the image B to be matched are high-level fusion features, and AxB is all possible corresponding relations of pixels in the images A and B to be matched;
calculating a scoring matrix for all possible matching modes
Figure BDA0003731006350000042
By maximizing the total score sigma i,j S i, j P i,j To obtain an optimal allocation matrix P; the optimal distribution matrix P can represent the optimal correspondence between the high-level fusion feature vector in the image a to be matched and the high-level fusion feature vector in the image B to be matched.
Further, in order to accurately increase the coarse matching result of the sparse texture region, a sliding window adaptive score threshold detection algorithm is adopted for the coarse matching through the score matrix, and the method specifically comprises the following steps:
1) Setting an initial score threshold value as theta, setting the area of a sliding window, the horizontal sliding step length and the vertical sliding step length, and performing sliding detection on the score of the high-level fusion feature vector;
2) If the scores s of all high-level fusion feature vectors in the current window are less than theta, calculating an adaptive threshold value avg theta in the matched sparse node in the window; traversing the high-level fusion feature vectors in the window, if the score s of the vector existing in the current window is larger than avg theta, adding the vector into the coarse matching point set, and continuously sliding the window;
3) If the score s of the feature vector existing in the current window is larger than or equal to theta, the window continues to slide;
4) Repeating the steps until the window slides and traverses the high-level fusion characteristics of the images to be matched, and outputting a coarse matching point set;
the calculation method of the area of the sliding window, the horizontal sliding step length and the vertical sliding step length is as follows
Figure BDA0003731006350000051
Figure BDA0003731006350000052
Figure BDA0003731006350000053
ws is the area of the sliding window, hl is the horizontal sliding step length, and vl is the vertical sliding step length;
the calculation formula of the adaptive threshold value avg theta in the matching sparse node is as follows
Figure BDA0003731006350000054
Wherein n is the number of the characteristic vectors in the sliding window, s i The matching scores of the feature vectors in the sliding window are obtained.
Since the number of the matching point pairs in the sparse texture region or the single texture region is less in the multi-source remote sensing image matching process, the number of the high-score matching point pairs is possibly lower than that in the dense matching region, the score threshold of the sparse matching region is reduced adaptively, the low-score matching point pairs which are increased in the sparse matching region can be screened, the matching data is supplemented, meanwhile, the redundant low-score point pairs in the dense matching region cannot be selected, and the introduction of errors is avoided.
Further, in order to improve the matching precision, the process of mapping the coarse matching result to the low-level features to perform calibration optimization on the dense matching is as follows:
1) Taking N pairs of feature points on high-level fusion features with correlation of each image to be matched as a center, wherein the N pairs of feature points refer to a coarse matching point set screened out after coarse matching; respectively cutting N pairs of local windows with the size of m multiplied by m on the corresponding high-level fusion characteristics;
2) Mapping the N pairs of windows into fine fusion features of the image to be matched to obtain N pairs of coarse windowsMatching the local fine window with the characteristic point pair as the center, inputting N pairs of local fine windows into a characteristic transformation module, transforming the windows for a plurality of times to generate N pairs of local fine fusion characteristic graphs of images A and B with the rough matching characteristic point pair as the center
Figure BDA0003731006350000061
3) Each will be
Figure BDA0003731006350000062
The feature vector corresponding to the center point P of (1) and corresponding
Figure BDA0003731006350000063
All vectors in (1) are correlated to generate
Figure BDA0003731006350000064
A desired value of a matching probability distribution of each pixel of (a) to P; the expected value of the matching probability distribution is calculated as follows
Figure BDA0003731006350000065
Wherein V A (P) is
Figure BDA0003731006350000066
Feature vector of center point P, V B (x) Is composed of
Figure BDA0003731006350000067
The characteristic vector of a certain pixel point x, and y is the pixel gradient of the pixel point x on the image B; the pixel point with the highest probability value obtained by calculation is a fine matching result of the point P on the image A with sub-pixel precision on the image B, and the result is taken as a final matching result.
Because the rough matching result is obtained under the condition of high-level fusion feature matching, and the high-level fusion feature description may have matching errors for the multi-source remote sensing image with larger difference, the rough matching result is relocated to the fine fusion feature, the best matching fine feature point is selected, and the precise matching result of the multi-source remote sensing image with higher resolution and more precise positioning can be obtained.
Furthermore, in order to ensure the accuracy of the matching result, after the fine matching result is obtained, a PROSAC algorithm is adopted to perform error matching inspection and elimination on the fine matching result again.
Drawings
FIG. 1 is a schematic diagram of a feature extraction network according to embodiment 1 of the present invention;
FIG. 2 is a diagram illustrating a convolution block structure according to embodiment 1 of the present invention;
FIG. 3 is a schematic view of a characteristic connection structure of embodiment 1 of the method of the present invention;
FIG. 4a is a visualization of H/8 xW/8 x 256 high-level feature map of embodiment 1 of the method of the present invention, and FIG. 4b is a visualization of H/2 xW/2 x 128 fine fusion feature map of embodiment 1 of the method of the present invention;
FIG. 5 is a flowchart of a feature transformation process according to embodiment 1 of the present invention;
fig. 6a is an optical image of a drone of a first image pair of a comparative example of the present invention, fig. 6b is a thermal infrared image of a drone of a first image pair of a comparative example of the present invention, fig. 6c is a ZY-3 panchromatic image of a second image pair of a comparative example of the present invention, fig. 6d is a GF-3SAR image of a second image pair of a comparative example of the present invention, fig. 6e is a summer *** image of a third image pair of a comparative example of the present invention, fig. 6f is a winter *** image of a third image pair of a comparative example of the present invention, fig. 6g is a *** optical image of a fourth image pair of a comparative example of the present invention, fig. 6h is a ZY-3 panchromatic image of a fourth image pair of a comparative example of the present invention, fig. 6i is a *** optical image of a fifth image pair of a comparative example of the present invention, fig. 6j is a GF-2 panchromatic image of a fifth image of a comparative example of a ***, fig. 6k is a sixth image of a sixth image pair of a *** optical image of a comparative example of the present invention, and fig. 6l is a grid image of a sixth image of a grid image of a comparative example of the present invention;
FIG. 7 is a schematic diagram of five-direction division according to a comparative example of the present invention;
fig. 8a is a schematic diagram of a fine matching result of a first group of image pairs according to a comparative example of the present invention, fig. 8b is a schematic diagram of a fine matching result of a second group of image pairs according to a comparative example of the present invention, fig. 8c is a schematic diagram of a fine matching result of a third group of image pairs according to a comparative example of the present invention, fig. 8d is a schematic diagram of a fine matching result of a fourth group of image pairs according to a comparative example of the present invention, fig. 8e is a schematic diagram of a fine matching result of a fifth group of image pairs according to a comparative example of the present invention, and fig. 8f is a schematic diagram of a fine matching result of a sixth group of image pairs according to a comparative example of the present invention;
fig. 9a is a schematic diagram of a purification result of a first group of image pairs according to a comparative example of the present invention, fig. 9b is a schematic diagram of a purification matching result of a second group of image pairs according to a comparative example of the present invention, fig. 9c is a schematic diagram of a purification result of a third group of image pairs according to a comparative example of the present invention, fig. 9d is a schematic diagram of a purification result of a fourth group of image pairs according to a comparative example of the present invention, fig. 9e is a schematic diagram of a purification result of a fifth group of image pairs according to a comparative example of the present invention, and fig. 9f is a schematic diagram of a purification result of a sixth group of image pairs according to a comparative example of the present invention;
fig. 10a is a matching effect of a LoFTR algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm, fig. 10b is a matching effect of a SuperPoint algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm, fig. 10c is a matching effect of a SIFT algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm, fig. 10d is a matching effect of a ContextDesc algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm, and fig. 10e is a matching effect of an FFM algorithm of a comparative example of the present invention on a first, third, and fourth sets of multi-source remote sensing image pairs after being purified by a PROSAC algorithm;
fig. 11a is an enlarged view of a final registration result and a partial windowing of a first group of image pairs according to a comparative example of the present invention, fig. 11b is an enlarged view of a final registration result and a partial windowing of a second group of image pairs according to a comparative example of the present invention, fig. 11c is an enlarged view of a final registration result and a partial windowing of a third group of image pairs according to a comparative example of the present invention, fig. 11d is an enlarged view of a final registration result and a partial windowing of a fourth group of image pairs according to a comparative example of the present invention, fig. 11e is an enlarged view of a final registration result and a partial windowing of a fifth group of image pairs according to a comparative example of the present invention, and fig. 11f is an enlarged view of a final registration result and a partial windowing of a sixth group of image pairs according to a comparative example of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments.
Method embodiment
The embodiment provides a multi-source remote sensing image depth feature fusion matching method, which comprises the following specific steps:
1) Constructing a matching model, wherein the matching model comprises a feature extraction network, a feature transformation module, a dense matching module and a calibration optimization module;
2) And inputting the acquired remote sensing image pairs to be matched into a feature extraction network, and respectively extracting the features of each image to be matched by the feature extraction network to obtain the high-level features of each image to be matched and the fine fusion features of high-level fine positioning information and low-level global information.
Referring to fig. 1, the structure of the feature extraction network is as follows:
(1) for image input, converting an input original image H multiplied by W into a characteristic image H multiplied by W multiplied by 1;
(2) the down-sampling layers (3) and (4) are the first, second and third down-sampling layers from the lower layer to the upper layer, wherein (2) the 128-dimensional upper layer feature (H/2 xW/2 x 128) of the dimension 1/2 of the original image is extracted, (3) the 196-dimensional upper layer feature (H/4 xW/4 x 196) of the dimension 1/4 of the original image is further extracted, and (4) the 256-dimensional upper layer feature (H/8 xW/8 x 256) of the dimension 1/8 of the original image is further extracted;
(2) the down sampling of the (3) and (4) belongs to convolution operation on an original image, and in order to reduce parameters and calculation amount, the (2) and (3) and (4) all adopt convolution block structures which can simply execute identical mapping in the forward propagation process of a neural network, and the once convolution result and the three convolution results are stacked together to obtain a down sampling result. Furthermore, the structure does not affect the back propagation of the training process. As shown in fig. 2, by superimposing the input feature map f1 and the output feature map f2, this simple connection can improve the training effect of the added model without adding extra parameters and computation to the network.
(5) The (8) is a bilinear interpolation part from a high layer to a low layer, the (6) and the (9) are feature connection parts, wherein the (5) extracts 196-dimensional low-layer fine features (H/4 xW/4 x 196) of 1/4 dimension of an original image, and the (6) connects and fuses the 196-dimensional high-layer features of 1/4 dimension of the original image extracted in the (3) with the 196-dimensional low-layer fine features of 1/4 dimension of the original image obtained in the (5) to form the (7) first upsampling layer which fuses the features containing bottom layer positioning information and the high-layer feature upsampling results rich in semantic information to obtain 196-dimensional fusion features of 1/4 dimension of the original image; and then extracting 128-dimensional fusion features of the original dimension 1/2 through (8), and connecting and fusing the 128-dimensional high-level features of the original dimension 1/2 extracted in (2) and the 128-dimensional fusion features of the original dimension 1/2 obtained in (8) through (9) to form a third layer, namely a second upsampling layer for fusing the features containing the bottom layer positioning information and the upsampling result of the fusion features after the first fusion again to obtain the 128-dimensional fine fusion features of the original dimension 1/2.
The feature extraction network structure finally outputs a high-level feature map with the original dimension of 1/8 and a fine fusion feature map with the original dimension of 1/2 for fusion feature matching. The final feature extraction result is obtained through down-sampling and refined up-sampling of feature extraction and feature fusion of two different layers, and the information globality and the positioning accuracy of the features can be greatly improved.
(6) The structure of the characteristic connection part of (9) is shown in FIG. 3. Taking (6) as an example, the part up-samples the high-level feature F3 (H/8 × W/8 × 256) with richer semantic information to obtain F4 (H/4 × W/4 × 196), convolves the previous-level feature F1 (H/4 × W/4 × 196) of F3 with a 1 × 1 convolution kernel to obtain F2 (H/4 × W/4 × 196), and then connects the F4 feature to F2 to obtain a fused feature F5 (H/4 × W/4 × 196) with high localization accuracy and information globality. The connection mode can adaptively adjust the characteristic scale, so that the F2 and the F4 can adapt to the scale difference, and the dependence on the single scale of the image is eliminated. In addition, the operation can fuse the positioning detail information of the low-level features with the rich semantic information of the high-level features, and the fused features can greatly enhance the representation capability of the features of the multi-source remote sensing images and resist the geometric difference, the scale difference and the like among the multi-source remote sensing images.
In this embodiment, images a and B to be matched are input into the feature extraction network, and 256-dimensional high-level features F of 1/8 of original image dimension are finally extracted through feature extraction and fusion A_C 、F B_C And 128-dimensional fine fusion feature F of dimension 1/2 of original image A_F 、F B_F The feature map visualization is shown in fig. 4, wherein fig. 4a is H/8 × W/8 × 256 high-level feature map visualization, and fig. 4b is H/2 × W/2 × 128 fine fusion feature map visualization.
2) And inputting the high-level features of the images to be matched into a feature transformation module, and performing feature transformation fusion to obtain the high-level fusion features with correlation of each image to be matched.
Because the feature difference between the multi-source remote sensing images is extremely large, especially the difference between the homonymous features causes the similarity between the homonymous features to be low, the subsequent matching effect is seriously influenced, and the requirement of precise matching is difficult to meet, the feature transformation module of the embodiment processes the 256-dimensional high-level features of the dimension 1/8 of the original image obtained in the step 1) through the attention layer in the local feature transformer to obtain the high-level fusion features with correlation.
Before the feature transformation, the 256-dimensional high-level features F of the dimension 1/8 of the original image obtained in the step 1) are firstly processed A_C 、F B_C Sinusoidal coding is added to provide unique position information for each pixel, so that the features can be uniquely corresponding to the positions of the pixels on the original image, and the features of different levels have uniquely determined corresponding relations, so that the high-level features can be mapped to the fine fusion features of the lower layer in the subsequent process, the problem of mismatching caused by overhigh similarity between feature vectors of image sparse texture regions can be avoided, and the sparse texture region can be improvedAnd (5) matching effect.
Then, the high-level feature F added with the position information A_C 、F B_C And flattening the vector into one-dimensional vectors L1 and L2, wherein the length of each vector is H/8 multiplied by W/8. Inputting the L1 and the L2 into a characteristic transformation module; the feature transformation module improves the similarity between the high-level features of the two images by interleaving the attention layer, and the processing flow is shown in fig. 5.
Referring to fig. 5, the input vector of the attention layer may be similar to a Query vector Query, a Key vector Key, and a Value vector Value in dictionary search, and is intended to convert the features and further fuse neighborhood information of the image itself and feature information of the image to be matched. For a Self-concern Layer (Self Layer), query = Key = Value = L1, which is equivalent to inputting a feature with position information to fuse local neighborhood information of the Self-concern Layer to generate a new feature map; for a Cross Layer of interest (Cross Layer), query = L1 and Key = Value = L2, the features of the input reference image with position information are fused with the features of the image to be matched. Calculating attention weight W between the features through dot product of Query and Value, and retrieving information from the Value; in this example, elu (Q) +1 and elu (K) were used T ) +1 replaces Query and Key, achieving the purpose of reducing the calculation cost. Furthermore, the binding law using the tensor matrix is first calculated (elu (K) T ) + 1). V to simplify the calculation. The final formula is as follows:
W(Q,K,V)=(elu(Q)+1)·[(elu(K T )+1)·V]
the interest layer selects the relevant information by measuring the similarity between the query vector and each key feature, and the output vector is the sum of the value vectors weighted by the similarity scores, so that the relevant information can be extracted from the value vectors with high similarity. In the Attention layer, the results of the Attention layer (Line Attention) are normalized and then superposed with L1 and L2 with position information to obtain one-dimensional characteristics F fusing the position information, the neighborhood information and the image information to be matched A_tr 、F B_tr . Because information loss and the like can exist in the normalization processing process of the attention layer, the obtained fusion feature F A_tr 、F B_tr Returning to L1 and L2 for processing again. Due to multiple source remote sensingThe image texture difference is large, the gray level difference is large, and the two high-level features need to have stronger correlation, so that the fusion features which are more reliable and have stronger correlation are obtained by increasing the interleaving processing times in the embodiment; the value of N may be set according to different matching accuracy requirements, and in a preferred embodiment, the interleaving number N =8, so that subsequent matching accuracy is the highest. Through the processing, the high-level fusion characteristic F for fusing the neighborhood information of the image information and the image information to be matched is output A_tr 、F B_tr The correlation between the high-level fusion features of the images A and B obtained after feature transformation is stronger, and the correlation between the non-homonymous features is weaker, so that the similarity of the homonymous features is higher than that of other features, and the mismatching condition is reduced.
3) Carrying out dense matching on all feature vectors on the high-level fusion features, and obtaining a coarse matching result according to the similarity between the feature vectors;
in the step 2), the two high-level fusion features have stronger correlation after feature transformation fusion, and dense matching corresponding to the pixel level of the high-level fusion features is established on the high-level fusion features to obtain a rough dense matching result. In the embodiment, the initial matching, namely the rough matching, among the multi-source remote sensing images is performed through the optimal matching layer, and the key point of the rough matching is the similarity contrast among the features.
Because the output vector of the feature transformation fusion device is the sum of the value vectors weighted by the similarity scores and carries the similarity information, the similarity between all the feature vectors can be represented by a score matrix between the vectors, and if the similarity is greater than a score threshold, the similarity between the two vectors is high, so that the two vectors can be regarded as correct matching.
In this embodiment, the high-level fusion feature F A_tr 、F B_tr The scoring matrix S in between is determined by:
Figure BDA0003731006350000111
wherein <, > represents the inner product; a × B is all possible corresponding relationships between pixels in the images a and B to be matched, and if the pixel in a is 640 × 480 and the pixel in B is 500 × 300, there may be 640 × 480 × 500 × 300 corresponding relationships between the pixel in a and the pixel in B, i is one of 640 × 480 pixels in the image a to be matched, and j is one of 500 × 300 pixels in the image B to be matched.
The scoring matrix is calculated for all possible matches, i.e. all possible correspondences of pixels in the images a, B to be matched
Figure BDA0003731006350000121
By maximizing the total score sigma i,j S i,j p i,j To obtain the optimal allocation matrix P. The optimal distribution matrix P can represent the optimal correspondence between the high-level fusion feature vector in the image a to be matched and the high-level fusion feature vector in the image B to be matched.
In this embodiment, the optimal distribution matrix P may be calculated by an entropy regularization formula of an optimal transmission algorithm, and the optimal transmission may effectively solve P by a Sinkhorn algorithm. Finally, a mutual nearest neighbor criterion (MNN) is executed to filter the matches which may have abnormity, and the two selection criteria are combined to obtain a relatively reliable and uniformly distributed matching result.
Since the number of sparse texture region or single texture region matching point pairs is small in the multi-source remote sensing image matching process, more high-score matching point pairs cannot be selected, so that the characteristics of the sparse texture region or the single texture region participating in matching are insufficient, and an ideal matching effect is difficult to achieve. Therefore, a sliding window adaptive score threshold detection algorithm is established for increasing the coarse matching result of the sparse texture region, which specifically comprises the following steps:
firstly, setting an initial score threshold value as theta, setting the area of a sliding window, a horizontal sliding step length and a vertical sliding step length, and carrying out sliding detection on the score of a high-level fusion feature vector with correlation; and if the score s of the feature vector existing in the current window is larger than or equal to theta, the window continues to slide.
When the scores s of all high-level fusion feature vectors in the current window are less than theta, calculating an adaptive threshold value avg theta in the matched sparse node in the window; and traversing the high-level fusion feature vector in the window, and if the score s of the vector existing in the current window is larger than avg theta, adding the vector into the coarse matching point set, and then continuing sliding the window.
Repeating the above operations until the window slides and respectively traverses the high-level fusion characteristics F A_tr 、F B_tr Outputting the high-level fusion feature F A_tr 、F B_tr The coarse matching point set.
Wherein, the calculation formulas of the area of the sliding window, the horizontal sliding step length and the vertical sliding step length are as follows
Figure BDA0003731006350000122
Figure BDA0003731006350000131
Figure BDA0003731006350000132
Wherein ws is the area of the sliding window, hl is the horizontal sliding step length, and vl is the vertical sliding step length;
the calculation formula of the adaptive threshold value avg theta in the matched sparse node is as follows
Figure BDA0003731006350000133
Wherein n is the number of eigenvectors in the sliding window, s i The matching scores of the feature vectors in the sliding window are obtained.
Since the number of the matching point pairs in the sparse texture region or the single texture region is less in the multi-source remote sensing image matching process, the number of the high-score matching point pairs is possibly lower than that in the dense matching region, the score threshold of the matching sparse region is reduced adaptively, the low-score matching point pairs which are increased in the matching sparse region can be screened, and the matching data is supplemented; meanwhile, the score threshold is adaptively reduced according to the feature sparsity of the region, so that redundant low-score point pairs of the dense matching region cannot be selected, and errors are avoided.
In this embodiment, the outdoor scene in the MegaDepth dataset is also used as a training set to train the coarse matching result. MegaDepth is a large depth data set generated from a large number of internet pictures and used for monocular depth estimation, and comprises about one hundred thousand outdoor three-dimensional scenes, the three-dimensional scenes can generate stereopairs with strict transformation relation and camera parameters thereof, and image points in the stereopairs have one-to-one matching pixel relation. Calculating the real match of the actual scene in the training set through the corresponding relation
Figure BDA0003731006350000134
And as a true value, combining an allocation matrix value representing a coarse matching result, and minimizing a difference value between the allocation matrix value and the true value to obtain an optimal matching result, namely a best match, of the coarse matching, so that the stability and reliability of the coarse matching are finally improved. According to the above principle, the loss function is shown as follows:
Figure BDA0003731006350000135
in the formula (I), the compound is shown in the specification,
Figure BDA0003731006350000136
in order to calculate the value of the allocation matrix,
Figure BDA0003731006350000137
a matrix truth value is assigned. The training has the function of enabling a rough matching result to continuously approach a known real matching result, mainly training data with ground real matching information such as illumination, large scale difference, day and night images and the like, and learning the real matching relation.
4) And mapping the coarse matching result to the fine fusion characteristic to carry out calibration optimization on the dense matching to obtain a fine matching result.
Because the coarse matching result is obtained under the resolution of 1/8 of the original image, when the coarse matching result is mapped to the original dimension, the position of the coarse matching result may have an error, that is, two high-level descriptors are extremely similar but may not be the most similar, and the error of a plurality of pixels exists in a local window, for example, the coarse matching obtained by the high-level feature resolution may be matching between feature vectors extracted by an 8 × 8 pixel area, and cannot be accurately positioned to a pixel level, and the high-level feature description may have an error for a multisource remote sensing image with a large difference. Therefore, the feature points obtained by rough matching in the step 3) are positioned in the fine fusion features obtained in the step 1) for calibration optimization, further fine matching is realized, and a multi-source remote sensing image fine matching result with higher resolution is obtained.
Specifically, when fine matching is performed, feature F is fused at a high level respectively by taking N feature points in the coarse matching point set screened in the step 3) as centers A_tr 、F B_tr Cutting up N pairs of local windows with the size of m multiplied by m; correspondingly mapping the N pairs of local windows to the fine fusion characteristics F of the images A and B to be matched A_F 、F B_F Obtaining N pairs of local fine windows with the coarse matching characteristic point pairs as centers; inputting N pairs of local fine windows into a feature transformation module, and transforming the windows for a plurality of times to generate N pairs of local fine fusion feature maps of images A and B with rough matching feature point pairs as centers
Figure BDA0003731006350000141
Wherein, the value of m can be specifically set according to actual needs: if the time requirement is high, the value of m is set to be about 5; if the final precision requirement on the matching result is higher, setting the precision requirement on the matching result to be about 8; if the hardware equipment is poor, the setting is about 3, and the running memory is small; m is set to 5 in this embodiment.
Then each will
Figure BDA0003731006350000142
The feature vector corresponding to the center point P of (2) and the feature vector corresponding to the center point P of (2)
Figure BDA0003731006350000143
In pairs
Figure BDA0003731006350000144
All vectors in (a) are correlated to generate the
Figure BDA0003731006350000145
A match probability distribution expectation value of each pixel in (b) with P; the expected value of the matching probability distribution is calculated as follows:
Figure BDA0003731006350000146
wherein V A (P) is
Figure BDA0003731006350000147
Feature vector of center point P, V B (x) Is composed of
Figure BDA0003731006350000148
The feature vector of a certain pixel point x, y is the pixel gradient of the pixel point x on the image B; the highest probability value obtained by calculation
Figure BDA0003731006350000149
The pixel point of (a) is a fine matching result of the point P on the image a with sub-pixel accuracy on the image B, and the result is taken as a final matching result.
In order to ensure the accuracy of the matching result, the precise matching result needs to be subjected to mismatching inspection and elimination again. The method adopts a progressive consistent sampling algorithm, namely a PROSAC algorithm to carry out mismatch elimination. The PROSAC algorithm is to sample from the continuously increased optimal matching point pair set, and although the algorithm is easily influenced by excessive mismatching points, the algorithm becomes extremely unstable, but the number of the existing mismatching points is very small through the combined processing of coarse matching and fine matching checking optimization, so that the method for eliminating the mismatching points by using the PROSAC method has strong adaptability.
Comparative example
In the comparative example, an experiment is performed on a multi-source remote sensing image depth feature fusion matching method (hereinafter, collectively referred to as an FFM algorithm) in a method embodiment in an ubuntu18.04 operating system, a programming language environment is python3.6, and a programming platform is Pycharm. The hardware platform uses a notebook computer carrying an I7 CPU, a 31G memory and a GeForce RTX 2060 video card (video memory 6 GB).
Six pairs of multi-source remote sensing images are selected for testing in the comparative example, wherein the first group of images are optical images of unmanned aerial vehicles and thermal infrared images of unmanned aerial vehicles, the second group of images are ZY-3 panchromatic images and GF-3SAR images, the third group of images are images of Google in summer and images of Google in winter, the fourth group of images are optical images of Google and ZY-3 panchromatic images, the fifth group of images are optical images of Google and GF-2 panchromatic images, and the sixth group of images are optical images of Google and images of OSM grid maps. The specific images are shown in fig. 6a-6l, wherein fig. 6a is the drone optical image of the first group image pair, fig. 6b is the drone thermal infrared image of the first group image pair, fig. 6c is the ZY-3 panchromatic image of the second group image pair, fig. 6d is the GF-3SAR image of the second group image pair, fig. 6e is the summer *** image of the third group image pair, fig. 6f is the winter *** image of the third group image pair, fig. 6g is the *** optical image of the fourth group image pair, fig. 6h is the ZY-3 panchromatic image of the fourth group image pair, fig. 6i is the *** optical image of the fifth group image pair, fig. 6j is the GF-2 panchromatic image of the fifth group image pair, fig. 6k is the *** optical image of the sixth group image pair, and fig. 6l is the OSM raster map image of the sixth group image pair.
The comparative analysis of six groups of multi-source remote sensing image data is shown in table 1:
TABLE 1 comparative analysis of test data
Figure BDA0003731006350000151
Figure BDA0003731006350000161
The performance of the matching algorithm is evaluated by adopting the correct matching point number (P), the Matching Accuracy (MA), the Root Mean Square Error (RMSE) of the matching point and the matching time (t). Since the matching algorithm of the present comparative example focuses on obtaining a more uniform matching result, the degree of uniformity of the distribution of the matching result is measured by the degree of uniformity of the distribution of the matching points (RSD).
The correct matching points are the number of points with the difference between the actual position of the feature points on the image to be matched and the actual position of the feature points on the reference image within a threshold value, and are obtained by the following verification formula:
Figure BDA0003731006350000162
in the formula, H is a real affine transformation model of replacing two multisource remote sensing images by an affine transformation model fitted by artificial points, and the characteristic point (x) i ,y i ) After affine transformation, the point (x) with the same name as the affine transformation point i ,y i ) If the distance of (2) is less than the threshold value epsilon, judging that the distance is a correct matching point; in the present comparative example, the threshold value was set to 3. The number (P) of correct matching points refers to the number of matching points meeting the above conditions, and the index can reflect the basic performance of the feature matching algorithm.
The matching accuracy rate is the ratio of the number of correct matching points to the number of all matching points, and the index can reflect the performance of successful matching of the algorithm.
The root mean square error of the matching point is the result x of the affine transformation of the correct matching point The square root of the ratio of the sum of the squares of the differences from the true x to the number n of correct matches is given by:
Figure BDA0003731006350000163
the distribution uniformity of the matching points is calculated according to the distribution uniformity of the matching results in five directions, and the image is divided into ten regions in total in five directions, as shown in fig. 7. The matching point error estimation is an important index for measuring the matching effect, and the root mean square error is very sensitive to the response of extra-large or extra-small errors in a group of transformations, so that the root mean square error can well reflect the accuracy of the matching result of the multi-source remote sensing image. The true value used by the root mean square error of the matching point is the real pixel coordinate, and the deviation does not exist, so that the method is more suitable for estimating the error of the matching point.
According to the statistical principle, the sample variance is used to represent the difference of the number of matching points in the image blocks in five different directions, if the distribution of the matching point pairs in the five directions is relatively uniform, the sample variance of the number of the matching point pairs in the five directions is relatively small, otherwise, the sample variance is relatively large. The uniformity of the distribution of the matching points is shown as follows:
Figure BDA0003731006350000171
in the formula, V is a region statistical distribution vector, and the vector is formed by combining the number of matching points in ten regions. The larger the uniformity of the distribution of the matching points is, the more uniform the distribution of the matching points is proved, otherwise, the non-uniform the distribution of the matching points is.
A plurality of representative algorithms which can be used for multi-source remote sensing image matching are selected in the test for comparison and analysis, wherein the algorithms comprise a SuperPoint algorithm, a ContextDesc algorithm, a LoFTR algorithm and a SIFT classic algorithm based on deep learning characteristics. The SuperPoint algorithm is a deep learning self-monitoring algorithm for extracting feature points and descriptors; the ContextDesc algorithm is a deep learning matching algorithm specially designed for multi-mode images, and the ContextDesc is an original feature descriptor such as DELF (Deltoid-class analysis) enhanced by high-level image visual information and geometric information of key point distribution. SIFT, which is a scale-invariant feature transform, is a local feature descriptor with certain affine invariance and interference resistance. The above four algorithms and the FFM algorithm of this comparative example are applied to the above 6 sets of image pairs to perform matching tests, and the matching test result pairs are shown in table 2.
TABLE 2 comparison of matching test results
Figure BDA0003731006350000172
Figure BDA0003731006350000181
As can be seen from table 2, the FFM algorithm achieves good matching effects on six pairs of multi-source remote sensing images, and obtains a sufficient number of correct matching points within a dominant time.
As can be seen from comparison of the table 2, for multi-source remote sensing image pairs in different modes, the FFM algorithm can obtain more correct matching point numbers, and the FFM algorithm has advantages and disadvantages with the LoFTR algorithm in different image pairs, but the number of the FFM algorithm is far higher than that of the three algorithms of other algorithms. Due to the fact that the gray level difference of the multi-source remote sensing image is large and the local gradient information of key points is inconsistent, the SIFT algorithm fails to match the visible light image with the thermal infrared image, the panchromatic image with the SAR image and the optical image with the raster map. Compared with the SIFT algorithm, the FFM algorithm is more stable in matching multi-source remote sensing images with large gray difference and inconsistent local gradient information. The SuperPoint algorithm is greatly improved in correct matching point number, matching point root mean square error and time relative to SIFT and ContextDesc algorithms, and shows that the SuperPoint algorithm has stronger adaptability to multi-source remote sensing images, but the performance of the SuperPoint algorithm is generally lower than that of the FFM algorithm. The ContextDesc algorithm is used for integrating multiple features for matching, but the matching effect is relatively poor, which shows that the ContextDesc algorithm does not have full adaptability to multi-source remote sensing images with larger differences, and the ContextDesc algorithm fails to match panchromatic images and SAR images and optical images and raster maps, which shows that the algorithm has poor resistance to nonlinear radiation distortion and local gradient information difference of the multi-source remote sensing images.
From the root mean square error of the matching points, the LoFTR algorithm and the SuperPoint algorithm perform well compared with the SIFT algorithm and the ContextDesc algorithm, but have a certain difference with the FFM algorithm, particularly on a full-color image and an SAR image, and the comparison shows that the feature positioning precision of the FFM algorithm is higher. The FFM algorithm is inferior to the SuperPoint algorithm and the LoFTR algorithm in terms of time due to two stages of the sliding window search detection algorithm and the matching detection algorithm of the initial matching. The RMSE results tested by the FFM algorithm on the six sets of images have a certain difference, compared with the third set of data, the first set of data and the second set of data have more buildings, and because the buildings on different images have different projection parallaxes, the building regions between the images have larger local deformation, and such local geometric deformation is difficult to eliminate by an affine transformation model, so the RMSE of the matching results is relatively large.
Aiming at the distribution uniformity of the matching points, the FFM algorithm and the LoFTR algorithm are emphatically compared because the number of correct matching points of other algorithms is less, and the comparison result is shown in Table 3.
TABLE 3 comparison of uniformity of distribution of matching points
Figure BDA0003731006350000191
As can be seen from table 3, the uniformity of the matching points of the FFM algorithm on the six sets of multi-source remote sensing image pairs is greater than LoFTR, the uniformity of the matching points adopts logarithmic operation, and is reflected to the variance of five-direction distribution, the uniformity of the matching points of the FFM is obviously superior to that of the LoFTR, and the experiment proves the effectiveness of the sliding window adaptive score detection algorithm in detecting the characteristics of the matched sparse region. After the fine matching is completed in this comparative example, the matching results are shown in fig. 8a to 8 f; fig. 8a is a schematic diagram of a fine matching result of a first group of image pairs, fig. 8b is a schematic diagram of a fine matching result of a second group of image pairs, fig. 8c is a schematic diagram of a fine matching result of a third group of image pairs, fig. 8d is a schematic diagram of a fine matching result of a fourth group of image pairs, fig. 8e is a schematic diagram of a fine matching result of a fifth group of image pairs, and fig. 8f is a schematic diagram of a fine matching result of a sixth group of image pairs.
Therefore, the FFM algorithm has strong adaptability to the multi-source remote sensing image, a considerable number of matching point pairs are obtained, and the distribution of the characteristic points is uniform. For a small number of mismatching points still existing, the mismatching is removed by using a PROSAC algorithm to achieve the purpose of purifying the matching point pairs, and the purification results are shown in fig. 9a-9f, where fig. 9a is a schematic diagram of the purification results of a first group of image pairs, fig. 9b is a schematic diagram of the purification matching results of a second group of image pairs, fig. 9c is a schematic diagram of the purification results of a third group of image pairs, fig. 9d is a schematic diagram of the purification results of a fourth group of image pairs, fig. 9e is a schematic diagram of the purification results of a fifth group of image pairs, and fig. 9f is a schematic diagram of the purification results of a sixth group of image pairs.
Therefore, the PROSAC algorithm can effectively eliminate the mismatching point pairs, and the finally reserved matching point pairs are uniformly distributed to a greater extent, so that a good foundation is laid for subsequent image registration, fusion and other work.
In order to more intuitively show the performance among several algorithms, the matching effect of the LoFTR algorithm, the SuperPoint algorithm, the ContextDesc algorithm, the SIFT algorithm and the FFM algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown through FIGS. 10a to 10 e. The matching effect of the LoFTR algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10a, the matching effect of the SuperPoint algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10b, the matching effect of the SIFT algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10c, the matching effect of the ContextDesc algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10d, and the matching effect of the FFM algorithm on the first group, the third group and the fourth group of multi-source remote sensing image pairs after being purified by the PROSAC algorithm is shown in the figure 10 e.
The results of fig. 10a to 10e show that, for a general optical image and a thermal infrared image, and an optical image and a full-color image, FFM can better overcome the problem of difficult matching caused by gray gradient difference and scale difference compared with SuperPoint, contextDesc and SIFT, and obtain a considerable number of correct matching point pairs. For optical images of different time phases, the FFM has great advantages in vegetation difference areas, and the FFM benefits from the learning of the ground truth relation among the feature vectors in the training process.
The multi-source remote sensing image registration is one of the important purposes of image matching, so that the quality of a matching result can be visually indicated by the quality of a registration effect. The purified matching point pairs are used for a multi-source image registration test, affine transformation parameters are calculated through the matching point pairs to correct and register multi-source images, and a final registration result and a local windowing enlarged view are shown in fig. 11a-11f, wherein fig. 11a is the final registration result and the local windowing enlarged view of a first group of image pairs, fig. 11b is the final registration result and the local windowing enlarged view of a second group of image pairs, fig. 11c is the final registration result and the local windowing enlarged view of a third group of image pairs, fig. 11d is the final registration result and the local windowing enlarged view of a fourth group of image pairs, fig. 11e is the final registration result and the local windowing enlarged view of a fifth group of image pairs, and fig. 11f is the final registration result and the local windowing enlarged view of a sixth group of image pairs.
As can be seen from fig. 11a to 11f, the FFM algorithm has strong adaptability to the registration of the visible light image and the thermal infrared image, the panchromatic image and the SAR image. The method has the advantages that accurate registration is realized in local areas with large gray difference and obvious ground object difference, the registration error of each area is basically controlled within 3 pixels, and the registration result shows that the position accuracy of the result obtained by the FFM algorithm is high, the distribution is uniform, and the method has strong performance.
According to the method, when the characteristics are extracted in the early stage, the acquired fine fusion characteristics of the image are enabled to have high-level information and low-level information simultaneously through characteristic extraction fusion, so that the positioning accuracy and the global property and the anti-interference capability can be guaranteed; and moreover, the extracted high-level features are subjected to feature transformation fusion, so that the similarity between matching points in the high-level features of the image to be matched is improved, and the matching result is more reliable. When later-stage features are matched, the high-level fusion features are compared for rough matching to obtain a matching result more conforming to the global features, and then the fine fusion features are compared to correct the matching result, so that the reliability of the global aspect and the precision aspect of the matching result is considered, the resolution is higher, and the positioning is more accurate. Sinusoidal coding is carried out on all high-level feature vectors before feature transformation fusion is carried out, so that the features of different levels have unique determined corresponding relation, and the problem of mismatching caused by overhigh similarity among feature vectors of image sparse texture regions is avoided; and when rough matching is carried out, the score threshold is also reduced through the adaptability of the sliding window, so that the number of the matching points screened out in the sparse region is increased, and the matching effect of the sparse region can be comprehensively improved from two aspects.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (9)

1. A multi-source remote sensing image depth feature fusion matching method is characterized by comprising the following steps:
1) Constructing a matching model, wherein the matching model comprises a feature extraction network, a feature transformation module, a dense matching module and a calibration optimization module;
2) Inputting the obtained remote sensing image pairs to be matched into a feature extraction network, and respectively extracting features of each image to be matched by the feature extraction network to obtain high-level features of each image to be matched and fine fusion features of high-level fine positioning information and low-level global information;
3) Simultaneously inputting the high-level features of the obtained image pair to be matched into a feature transformation module, and performing feature transformation fusion on each image to be matched by fusing neighborhood information of the image and the high-level features of the image to be matched to obtain the high-level fusion features with correlation of each image to be matched in the image pair to be matched;
4) Carrying out dense matching on all the feature vectors on the high-level fusion features of the obtained image pair to be matched, and obtaining a coarse matching result according to the similarity between the feature vectors;
5) And mapping the coarse matching result to the fine fusion characteristic to carry out calibration optimization on the dense matching to obtain a fine matching result.
2. The multi-source remote sensing image depth feature fusion matching method according to claim 1, wherein the feature extraction network comprises three down-sampling layers and two up-sampling layers;
the first down-sampling layer is used for down-sampling the input image to be matched to obtain a high-level feature map with the original dimension of 1/2 of the image to be matched; the second down-sampling layer is used for down-sampling the input high-level feature map of the original dimension 1/2 of the image to be matched to obtain the high-level feature map of the original dimension 1/4 of the image to be matched; the third down-sampling layer is used for down-sampling the input high-level feature map with the original dimensionality of 1/4 of the image to be matched to obtain the high-level feature map with the original dimensionality of 1/8 of the image to be matched;
the first up-sampling layer is used for up-sampling an input high-level feature map with original dimension 1/8 of an image to be matched into a low-level feature map with original dimension 1/4 of the image to be matched, meanwhile, performing convolution processing on the high-level feature map with original dimension 1/4, and then fusing the low-level feature map with original dimension 1/4 and the high-level feature map with original dimension 1/4 to obtain a fused feature map with original dimension 1/4; the second upsampling layer is used for upsampling the input fusion feature map with the original dimension of 1/4 into a low-level feature map with the original dimension of 1/2, simultaneously performing convolution processing on the high-level feature map with the original dimension of 1/2, and then fusing the low-level feature map with the original dimension of 1/2 and the high-level feature map with the original dimension of 1/2 to obtain a fine fusion feature map with the original dimension of 1/2;
and outputting a high-level feature map of the original dimension 1/8 and a fine fusion feature map of the original dimension 1/2 of each image to be matched through the three down-sampling layers and the two up-sampling layers for fusion feature matching.
3. The multi-source remote sensing image depth feature fusion matching method according to claim 1, wherein the fusion feature matching of the feature transformation module specifically comprises the following steps:
(1) adding position information to a high-level feature map with 1/8 of the original dimension of the image to be matched, and enabling the feature to be uniquely corresponding to the position of the feature on the original image;
(2) flattening a high-level feature map with position information of an image to be matched into a one-dimensional vector, inputting the one-dimensional vector to a feature transformation module, performing multiple interweaving processing through a concerned layer, and outputting high-level fusion features with correlation of each image to be matched;
the attention layers comprise a self-attention layer, a cross-attention layer and an attention layer; the self-attention layer is used for fusing local neighborhood information of the self-attention layer with the characteristics with position information of the reference image in the input image pair to be matched to generate a new characteristic diagram; the cross attention layer fuses the input reference image feature with position information and the feature of the other image to be matched in the pair of images to be matched; the attention layer selects the feature information of the vector with high similarity by comparing the similarity between the input query vector and each key feature, and the selected result is subjected to normalization processing and then is superposed with the flattened one-dimensional vector of the high-level feature map with the position information of each image to be matched to obtain fused position information, neighborhood information and fused features of the image information to be matched; and the multiple interweaving processing refers to inputting the obtained fusion features into a concerned layer to carry out interweaving processing again, repeating the process and finally outputting the high-level fusion features of the images to be matched.
4. The multi-source remote sensing image depth feature fusion matching method according to claim 2, wherein the three downsampling layers all adopt a convolution block structure, element superposition is carried out on the input feature map and the output feature map, and the feature map obtained after superposition is used as a downsampling result.
5. The multi-source remote sensing image depth feature fusion matching method according to claim 3, wherein the adding position information is adding sinusoidal codes to each pixel feature.
6. The multi-source remote sensing image depth feature fusion matching method according to claim 3, wherein the rough matching represents the similarity between all feature vectors through a score matrix between high-level fusion feature vectors, and if the similarity is greater than a score threshold, the similarity is regarded as correct matching;
the scoring matrix S between the vectors is determined by where < > represents the inner product;
Figure FDA0003731006340000031
wherein, F A_tr 、F B_tr The image A and the image B to be matched are high-level fusion features, and AxB is all possible corresponding relations of pixels in the images A and B to be matched;
calculating a scoring matrix for all possible matching modes
Figure FDA0003731006340000032
By maximizing the total score ∑ i,j S i,j P i,j To obtain an optimal allocation matrix P; the optimal distribution matrix P can represent the optimal correspondence between the high-level fusion feature vector in the image a to be matched and the high-level fusion feature vector in the image B to be matched.
7. The multi-source remote sensing image depth feature fusion matching method according to claim 6, wherein the coarse matching through the score matrix adopts a sliding window adaptive score threshold detection algorithm, and specifically comprises the following steps:
a) Setting an initial score threshold value as theta, setting the area of a sliding window, a horizontal sliding step length and a vertical sliding step length, and performing sliding detection on the score of the high-level fusion feature vector;
b) If the scores s of all high-level fusion feature vectors in the current window are less than theta, calculating an adaptive threshold value avg theta in the matched sparse node in the window; traversing the high-level fusion feature vectors in the window, if the score s of the vector existing in the current window is larger than avg theta, adding the vector into the coarse matching point set, and continuously sliding the window;
c) If the score s of the feature vector existing in the current window is larger than or equal to theta, the window continues to slide;
d) Repeating the steps until the window slides and traverses the high-level fusion features with correlation of the images to be matched, and outputting a coarse matching point set;
the calculation method of the area of the sliding window, the horizontal sliding step length and the vertical sliding step length is as follows
Figure FDA0003731006340000033
Figure FDA0003731006340000034
Figure FDA0003731006340000035
ws is the area of the sliding window, hl is the horizontal sliding step length, and vl is the vertical sliding step length;
the calculation formula of the adaptive threshold value avg theta in the matched sparse node is as follows
Figure FDA0003731006340000041
Wherein n is the number of the characteristic vectors in the sliding window, s i The matching scores of the feature vectors in the sliding window are obtained.
8. The multi-source remote sensing image depth feature fusion matching method according to claim 1, wherein the process of mapping the coarse matching result to the low-level features to perform calibration optimization on the dense matching is as follows:
i) Taking N pairs of feature points on the high-level fusion features of each image to be matched as a center, wherein the N pairs of feature points refer to a coarse matching point set screened out after coarse matching; respectively cutting N pairs of local windows with the size of m multiplied by m on the corresponding high-level fusion characteristics;
II) mapping the N pairs of windows to the fine fusion features of the image to be matched to obtain N pairs of local fine and fine features with the coarse matching feature point pairs as centersWindow, inputting N pairs of local fine windows into the feature transformation module, transforming the windows several times to generate N pairs of local fine fusion feature maps of images A and B with rough matching feature point pairs as centers
Figure FDA0003731006340000042
III) mixing each
Figure FDA0003731006340000043
The feature vector corresponding to the center point P of (a) and (b)
Figure FDA0003731006340000044
All vectors in (1) are correlated to generate
Figure FDA0003731006340000045
A desired value of a matching probability distribution of each pixel of (a) to P; the expected value of the matching probability distribution is calculated as follows:
Figure FDA0003731006340000046
wherein V A (P) is
Figure FDA0003731006340000047
Feature vector of center point P, V B (x) Is composed of
Figure FDA0003731006340000048
The feature vector of a certain pixel point x, y is the pixel gradient of the pixel point x on the image B; the pixel point with the highest probability value obtained by calculation is a fine matching result of the point P on the image A with sub-pixel precision on the image B, and the result is taken as a final matching result.
9. The multi-source remote sensing image depth feature fusion matching method according to claim 1, wherein after the fine matching result is obtained, a PROSAC algorithm is further adopted to perform mismatching check and elimination on the fine matching result again.
CN202210792899.XA 2022-07-05 2022-07-05 Multi-source remote sensing image depth feature fusion matching method Pending CN115240079A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210792899.XA CN115240079A (en) 2022-07-05 2022-07-05 Multi-source remote sensing image depth feature fusion matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210792899.XA CN115240079A (en) 2022-07-05 2022-07-05 Multi-source remote sensing image depth feature fusion matching method

Publications (1)

Publication Number Publication Date
CN115240079A true CN115240079A (en) 2022-10-25

Family

ID=83671142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210792899.XA Pending CN115240079A (en) 2022-07-05 2022-07-05 Multi-source remote sensing image depth feature fusion matching method

Country Status (1)

Country Link
CN (1) CN115240079A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117078982A (en) * 2023-10-16 2023-11-17 山东建筑大学 Deep learning-based large-dip-angle stereoscopic image alignment dense feature matching method
CN117422746A (en) * 2023-10-23 2024-01-19 武汉珈和科技有限公司 Partition nonlinear geographic registration method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117078982A (en) * 2023-10-16 2023-11-17 山东建筑大学 Deep learning-based large-dip-angle stereoscopic image alignment dense feature matching method
CN117078982B (en) * 2023-10-16 2024-01-26 山东建筑大学 Deep learning-based large-dip-angle stereoscopic image alignment dense feature matching method
CN117422746A (en) * 2023-10-23 2024-01-19 武汉珈和科技有限公司 Partition nonlinear geographic registration method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN114782691A (en) Robot target identification and motion detection method based on deep learning, storage medium and equipment
CN109035172B (en) Non-local mean ultrasonic image denoising method based on deep learning
CN115240079A (en) Multi-source remote sensing image depth feature fusion matching method
CN111652273B (en) Deep learning-based RGB-D image classification method
CN112907602A (en) Three-dimensional scene point cloud segmentation method based on improved K-nearest neighbor algorithm
CN113610905B (en) Deep learning remote sensing image registration method based on sub-image matching and application
CN114758337A (en) Semantic instance reconstruction method, device, equipment and medium
WO2024114321A1 (en) Image data processing method and apparatus, computer device, computer-readable storage medium, and computer program product
CN116385707A (en) Deep learning scene recognition method based on multi-scale features and feature enhancement
CN117830788B (en) Image target detection method for multi-source information fusion
CN116310098A (en) Multi-view three-dimensional reconstruction method based on attention mechanism and variable convolution depth network
CN115880553A (en) Multi-scale change target retrieval method based on space-time modeling
CN114782503A (en) Point cloud registration method and system based on multi-scale feature similarity constraint
CN110956601A (en) Infrared image fusion method and device based on multi-sensor mode coefficients and computer readable storage medium
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN112686830B (en) Super-resolution method of single depth map based on image decomposition
CN111597367A (en) Three-dimensional model retrieval method based on view and Hash algorithm
CN113920587B (en) Human body posture estimation method based on convolutional neural network
CN113628111B (en) Hyperspectral image super-resolution method based on gradient information constraint
CN115496788A (en) Deep completion method using airspace propagation post-processing module
CN115631513A (en) Multi-scale pedestrian re-identification method based on Transformer
CN114708315A (en) Point cloud registration method and system based on depth virtual corresponding point generation
CN114972937A (en) Feature point detection and descriptor generation method based on deep learning
CN117095033B (en) Multi-mode point cloud registration method based on image and geometric information guidance
CN116091787B (en) Small sample target detection method based on feature filtering and feature alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination