CN113326856B

CN113326856B - Self-adaptive two-stage feature point matching method based on matching difficulty

Info

Publication number: CN113326856B
Application number: CN202110884790.4A
Authority: CN
Inventors: 周军; 黄坤; 刘野; 李静远
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2021-12-03
Anticipated expiration: 2041-08-03
Also published as: CN113326856A

Abstract

The invention discloses a self-adaptive two-stage feature point matching method based on matching difficulty, and belongs to the technical field of image processing. The technical scheme of the invention is as follows: firstly, selecting a subsequent specific matching mode based on the picture difference degree between the picture pairs, and if the picture difference degree is smaller, directly performing matching processing based on the Euclidean distance between descriptors of the feature points; otherwise, performing dimension increasing processing on the position information of the feature points of each picture to enable the dimension of the feature points to be consistent with that of the descriptor, adding the descriptor and the position information after dimension increasing to obtain a new descriptor of each feature point, performing attention aggregation processing to obtain a matching descriptor of each feature point, and performing matching processing based on the inner product between the matching descriptors. The method is used for matching the feature points of the image pair, realizes self-adaptive two-stage feature point matching, and improves the matching accuracy and the processing efficiency.

Description

Self-adaptive two-stage feature point matching method based on matching difficulty

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a self-adaptive two-stage feature point matching method based on matching difficulty.

Background

The feature point matching refers to how to correctly match feature point sets of different pictures after the feature point sets are extracted from the pictures. There are important applications in computer vision tasks based on geometry. For example, the simultaneous positioning and mapping technology is a key technology of unmanned driving, feature point extraction and matching are carried out on the picture pairs shot by the camera, and the position information of the robot or the vehicle at each moment can be calculated according to the matching relation. The good matching result is not only beneficial to improving the accuracy of the subsequently calculated position, but also can provide a good initial value for the subsequent iterative algorithm and help the iterative algorithm to reach the optimal result as soon as possible.

The early stage of feature point matching is mainly using the nearest neighbor matching algorithm. When the feature point extraction is completed, a vector describing the surrounding information is obtained, and the vector is called a descriptor. And the nearest neighbor matching algorithm calculates Euclidean distances between descriptors as a measurement standard, and the feature points with smaller distances are more likely to be matched.

Although nearest neighbor matching performs well in some simple environments, it is difficult to achieve satisfactory results when matching under difficult scenes (e.g., blur, occlusion, large perspective transformation). So in recent years, with the excellent performance of neural networks in image processing, matching using neural networks has begun to be performed.

Although the matching task is completed by a neural network algorithm, great progress is made in accuracy. But with the attendant increase in computational effort, this makes it difficult to apply the feature point matching task to real-time applications, such as simultaneous localization and mapping.

Disclosure of Invention

The embodiment of the invention provides a self-adaptive two-stage feature point matching method based on matching difficulty, which is used for improving the accuracy of picture feature point matching.

The embodiment of the invention provides a self-adaptive two-stage feature point matching method based on matching difficulty, which comprises the following steps:

step 1, inputting a picture pair to be matched, wherein the picture information of the input picture pair comprises brightness information of a picture and characteristic point information of the picture, the characteristic point information comprises position information and a descriptor, and the position information comprises a spatial position coordinate and a confidence value of a characteristic point;

step 2, calculating the picture difference degree between the picture pairs to be matched based on the brightness information of the pictures, if the picture difference degree is greater than or equal to a difference threshold value, executing the steps 3 to 5, otherwise executing the step 6;

step 3, performing dimension increasing processing on the position information of the feature points of each picture to enable the dimension of the position information of the feature points after dimension increasing to be consistent with the dimension of the descriptor, and adding the descriptor and the position information after dimension increasing to obtain a new descriptor of each feature point;

step 4, carrying out attention aggregation processing on the new descriptor of each feature point to obtain a matching descriptor of each feature point;

step 5, calculating a matching result in an inner product mode:

respectively defining two pictures of the picture pair as a first picture and a second picture;

traversing each feature point of the first picture, calculating the matching degree between each current feature point of the first picture and the matching descriptor of each feature point in the second picture by adopting an inner product mode, and if the maximum matching degree is larger than or equal to a first matching threshold value, taking the feature point of the second picture corresponding to the maximum matching degree as the matching result of the current feature point of the first picture;

step 6, calculating a matching result in an Euclidean distance mode:

and traversing each feature point of the first picture, calculating the matching degree between each current feature point of the first picture and the descriptor of each feature point in the second picture by adopting an Euclidean distance mode, and if the minimum matching degree is less than or equal to a second matching threshold, taking the feature point of the second picture corresponding to the minimum matching degree as the matching result of the current feature point of the first picture.

In the embodiment of the invention, a subsequent specific matching mode is selected for the picture difference degree between the picture pairs, and if the picture difference degree is smaller, namely smaller than a specified difference threshold value, matching processing is directly carried out on the basis of a descriptor of the feature point; otherwise, after a series of calculation processing, obtaining a matching descriptor which can more characterize the feature point, adopting an inner product operation result between the two matching descriptors as a calculation value of the matching degree of the matching descriptor, and taking an object which is the largest and is larger than a first matching degree threshold value as a final matching result so as to improve the matching accuracy. In other words, in the embodiment of the present invention, adaptive two-stage feature point matching is implemented based on the disparity detection in the first stage and the feature point aggregation in the second stage (step 3) and the attention aggregation (step 4), so as to improve the matching accuracy and processing efficiency.

Further, in step 2, calculating the picture difference between the picture pairs to be matched based on the brightness information of the pictures as follows: and carrying out size normalization processing on the two pictures of the picture pair, and calculating the difference degree of the pictures according to the absolute error sum of the brightness information.

Further, in step 3, the way of performing the dimension-increasing processing on the position information of the feature point of each picture is as follows: and performing dimension-increasing processing on the position information of the feature points of each picture through a multilayer perceptron.

Further, in step 3, the multilayer perceptron is: defining L to represent the number of network layers of the multilayer perceptron, wherein the first L-1 layer is a stacking structure of L-1 first stacking blocks, and the L-th layer is an addition layer; the input of the 1 st first convolution block is position information of all feature points of a picture, the channel number of a feature map output by the L-1 st first convolution block is the same as the dimension of a descriptor, the input of the addition layer is the descriptor of all feature points of the picture and the output feature map of the L-1 st first convolution block, and the first convolution block comprises a convolution layer, a batch processing layer and an activation function ReLu layer which are sequentially connected.

Further, in step 4, the attention aggregation processing on the new descriptor of each feature point is as follows:

by using L^GThe graph network of layers performs an attention-aggregation process on the new descriptors of feature points, where L^GIs an odd number greater than 1;

front face L of the graph network^G-1 layer is a staggered layer structure of self layer and cross layer, the last layer is a full connection layer;

the self layer and the cross layer have the same network structure and are both neural network layers with attention mechanisms, the input of the self layer is different feature points of the same picture of the picture pair to be matched, and the input of the cross layer is different feature points of the two pictures of the picture pair to be matched.

Further, in step 4, the top L of the graph network^G-the network structure of each of the 1 layers is two stacked second convolutional blocks, the number of input channels of the 1 st second convolutional block of each layer is 2M and the number of output channels is 2M according to the forward propagation direction, and the convolution kernel size of the 1 st second convolutional block is 1 × 2 mx 2M; the number of input channels of the 2 nd second convolution block of each layer is 2M, the number of output channels is M, the convolution kernel size of the 2 nd second convolution block is 1 × 2M × M, wherein M represents the dimension of a descriptor, and the second convolution block comprises convolution layers and an activation function ReLu layer which are connected in sequence.

Further, in step 5, the value range of the first matching threshold is set to 9-11.

Further, in step 6, the value range of the second matching threshold is set to be 0.8-1.

The technical scheme provided by the embodiment of the invention at least has the following beneficial effects:

(1) according to the embodiment of the invention, the matching mode can be flexibly selected according to the matching difficulty degree by adding the picture difference degree detection, so that the speed is increased as much as possible while the accuracy is ensured. Compared with the traditional neural network processing scheme, the speed is obviously improved in a mixed environment (complex environment and simple environment).

(2) Compared with the traditional simple matching processing scheme, the embodiment of the invention uses the two-stage neural network, and ensures that higher matching accuracy can be achieved under complex conditions.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of an adaptive two-stage feature point matching method based on matching difficulty according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a judgment module adopted in a two-stage adaptive feature point matching method based on matching difficulty according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an aggregation module used in a matching difficulty-based adaptive two-stage feature point matching method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Feature point matching plays a very important role in many geometry-based computer vision tasks, such as three-dimensional reconstruction, simultaneous localization and mapping. The embodiment of the invention provides a feature point matching method capable of adaptively starting a two-stage neural network according to the matching difficulty degree, and compared with the traditional matching scheme, the matching accuracy is greatly improved in various environments. And under a simple environment, compared with a matching scheme based on a neural network, the speed is greatly improved.

Referring to fig. 1, an adaptive two-stage feature point matching method based on matching difficulty provided in an embodiment of the present invention includes the following steps:

step 1, inputting a picture pair to be matched, wherein the input picture information comprises brightness information of a picture and characteristic point information of the picture, and the characteristic point information comprises position information (space position coordinates and confidence values) and a descriptor;

step 2, detecting the picture difference: calculating the picture difference degree between the picture pairs to be matched based on the brightness information of the pictures, if the picture difference degree is larger than or equal to a difference threshold value, executing the steps 3 to 5, otherwise executing the step 6;

step 3, key point polymerization treatment: performing dimension increasing processing on the position information of the feature points of each picture to enable the dimension of the position information of the feature points after dimension increasing to be consistent with the dimension of the descriptor, and adding the descriptor and the position information after dimension increasing to obtain a new descriptor of each feature point;

step 4, attention polymerization post-treatment: carrying out attention aggregation processing on the new descriptor of each feature point to obtain a matching descriptor of each feature point;

step 5, calculating a matching result in an inner product mode:

step 6, calculating a matching result in an Euclidean distance mode:

In the embodiment of the present invention, the relevant processing in steps 2 to 4 is implemented by using a neural network, and the overall processing process in the embodiment of the present invention may be divided into two parts: the device comprises a judging module and a two-stage module, wherein the two-stage module specifically comprises: the device comprises a first-stage convergence module and a second-stage calculation module, wherein the second-stage calculation module comprises an inner product calculation module and a Euclidean distance calculation module.

Two pictures to be matched (a pair of pictures to be matched) are firstly subjected to picture difference degree detection through a judgment module, and if the picture difference degree reaches a specified difference threshold value, a first-stage polymerization module and a second-stage inner product calculation module are started; otherwise, only the Euclidean distance calculation module of the second stage is started, and finally the matching result is obtained based on the output result of the calculation module of the second stage.

In a possible implementation manner, the judging module is mainly used for judging the difficulty degree of matching. The matching difficulty degree is mainly determined by the difference between the pictures, when the difference between the two pictures is small, the feature points do not change greatly, and the matching difficulty is small. However, when the difference between the two pictures is large, for example, the illumination is changed violently, the vision is changed greatly, the blur appears, the occlusion appears, the feature point changes greatly, and the matching difficulty is large at this time. Referring to fig. 2, the judging module includes a normalization module (reshape module) and a difference judging module (sad of Absolute differences module). Namely, the judging module firstly utilizes a reshape module to change two pictures into the same size, namely, the size normalization processing is carried out, and then the absolute error Sum Algorithm (SAD) is utilized to judge the picture difference. Therefore, the difference degree of the pictures can be simply judged, the parallelism degree on a GPU (image processor) is high, a large amount of time cannot be wasted, and a specific calculation formula is as follows.

For the picture pairs to be matched, respectively recording as picture I_AAnd I_BAnd is defined as_A(i,j)、I_B(I, j) respectively represent pictures I_AAnd I_BThe pixel brightness value at pixel point (i, j). Picture I_AAnd I_BAfter the normalization module, the size is normalized to W x H, W represents the width, H represents the height, and Dscore is defined to represent the difference degree of the two pictures, which can be calculated by formula (1).

（1）

Referring to fig. 3, in one possible implementation, the aggregation module includes two parts, a keypoint aggregation and an attention aggregation.

Defining the number of feature points of picture A as N_AFeature point number of picture B is N_BWherein the feature points are obtained based on a feature point extraction process on the imageAny conventional method may be used. And definition of d_i ^AAnd d_i ^BThe descriptors respectively represent the ith feature point input by the picture a and the picture B, the dimension of the descriptor is assumed to be set, and in this embodiment, the dimension of the descriptor is 256. p is a radical of_i ^AAnd p_i ^BThe position information of the ith feature point input by the picture a and the picture B is respectively represented, the dimension is 3, and the spatial coordinates (x, y) and the confidence coefficient of the feature point are represented.

All the description subsets representing the picture a input,

all the description subsets representing the picture B input. Wherein

。

The key point aggregation is to fuse the position information of the feature points with descriptors, perform dimensionality enhancement on the position information of the feature points through a multi-layer perceptron (MLP), and add the dimensionality enhancement information with the descriptors to obtain new descriptors, wherein the new descriptors are used for later attention aggregation. Definition of

And

respectively representing the ith new descriptor obtained by polymerizing the picture A and the picture B through the key points, wherein the calculation formula can be characterized as follows:

（2）

wherein,

representing the output of a multi-layered perceptron, i.e. pairsAnd (5) a dimension-raising processing result of the position information of each feature point of the picture.

In one possible implementation, the network structure of the multi-layer perceptron used for acquiring the new descriptor is shown in table 1, where N represents the number of feature points, and N is N_AOr N_B。

Namely, in the embodiment of the present invention, the network structure of the multilayer perceptron adopted in the key point aggregation is: definition L represents the number of network layers of the multilayer perceptron, wherein the front L-1 layer is a stacked structure of L-1 first volume blocks, and the L-th layer is an addition layer, and the first volume blocks comprise a volume layer (Convlutionled), a batch layer (BatchNormld) and an activation function ReLu layer which are connected in sequence. The input of the multilayer perceptron is all feature points of one picture, the number (dimension) of channels is 3, namely the spatial position information and the confidence coefficient of the feature points; the output is all the feature points, the number of channels is M, and the value of M is the same as the dimension number of the descriptor. In the stacking structure of the L-1 first stacking blocks, the number of output channels of the front L-2 layers is increased layer by layer until the number of output channels of the L-2 th stacking block is M. And the addition layer is used for adding the position information after the dimension is increased and the descriptor to obtain a new descriptor.

As a preferred structure, the number of the rolling blocks is set to 5, wherein the number of output channels from the 1 st layer to the 5 th layer is: 32. 64, 128, 256.

The attention aggregation is to better aggregate information, and in the embodiment of the present invention, the overall network architecture adopts a graph network. I.e. each descriptor is used as a node to aggregate the information obtained by the attention mechanism.

As a preferred structure, the graph network is provided in 19 layers in total, and odd layers among the previous 18 layers are provided as self layers, and even layers are provided as cross layers. Layer 19 is a fully connected layer called the Final layer.

The self and cross refer to an introduced self-cross mechanism, and similar to human repeated alignment, the self-cross and cross can determine the subject of attention mechanism aggregation. When the surrounding information is aggregated previously, only the picture is left. In fact, there is also a need to aggregate information of the face-to-face pictures. That is, the aggregation object of self layer comes from the picture, the aggregation object of cross layer comes from the opposite picture of the picture pair, and the two layers appear alternately.

That is, in the embodiment of the present invention, the number of layers of the graph network is defined as L^GFront face L of^G1 layers of alternating layers of self and cross layers, i.e. L^GThe value of-1 is an even number and the last layer is a fully connected layer.

The following mathematical expression is given for the calculation of each layer:

definition of

The first to show picture AlThe ith descriptor of a layer is described,

the second of which represents picture BlI description of the layer, wherein

. And the output after the key point polymerization is taken as the 0 th layer, namely

And

is the output after the key point polymerization.

Definition of

And

respectively show that the picture A and the picture B are in the second placelThe set of all descriptors of a layer.

The first to show picture AlAggregation information of the ith descriptor of a layer,

the second of which represents picture BlAggregation information of the ith descriptor of a layer.

Updating operation is carried out based on the descriptor of the previous layer to obtain the descriptor of the current layer, and the specific calculation formula is shown as a formula (3):

（3）

wherein,

represents the upper layer of (lLayer 1), the symbol "II" represents a stitch, i.e. a stitch in the channel direction,

the output of the self layer or cross layer, i.e. the layer output of the graph network, is represented. The self layer and the cross layer have the same network structure but different inputs, and the network structure comprises two stacked second volume blocks, wherein the number of input channels of the first second volume block is 2M, the number of output channels of the first second volume block is 2M, and the convolution kernel size of the first second volume block is 1 multiplied by 2M; the number of input channels of the second convolution block is 2M, the number of output channels is M, and the convolution kernel size of the second layer is 1 × 2 mxm. For the case of a descriptor dimension of 256, the network structure parameters of the self layer or cross layer are shown in table 2:

in table 2, in the expression form k1 × k2 × k3 of the convolution kernel size, k1 × k2 indicates the convolution kernel shape, and k3 indicates the number of output channels.

Aggregating information

And

in the embodiment of the invention, an attention mechanism is adopted, and a self-cross mechanism is used for controlling the aggregation object.

The attention mechanism is for computing aggregated information and is similar to database queries. Define the second of picture AlThe input set of layers is

And

set of

The number of feature points in is

Set of

The number of feature points in is

And the second definition of picture BlThe input set of layers is

And

set of

The number of feature points in is

Set of

The number of feature points in is

. For the ith feature point of picture A, aggregate information is calculated

. Firstly, the query sequence is calculated by formula (4)

Index of

Sum value

。

（4）

Wherein,

to represent

The ith descriptor in (1), represents

In

The description of the jth descriptor,

is shown aslThe parameters of the layer linear mapping, i.e. the weights and the biases.

Similarly, the attention of the picture B can be obtainedQuery column, index and value of the mechanism:

and is and

，

，

wherein

representation collection

The description of the ith sub-group of (1),

in a representation set

The jth descriptor of (1).

In self layer:

for the picture a, it is shown that,

for the picture B, the picture B is,

in the cross layer:

for the picture a, it is shown that,

for the picture B, the picture B is,

。

then the information is aggregated

Is equal to

And

the difference is the weight, the value

See equation (5).

（5）

Wherein,

and

the fourth to represent respectively picture A and picture B

A weight between the ith element of the query of the layer and the jth element of the index, an

，

Softmax () represents a normalized exponential function, and the superscript "T" represents transposition.

Finally, when passing the preceding L^G-after layer 1 has aggregated information, fully connected layeringRespectively carrying out linear mapping on the descriptors obtained after the aggregation of the picture A and the picture B to obtain matching descriptors (see formula (6), wherein f_i ^AAnd f_i ^BIs the final matching descriptor

And

. Then F is mixed^AAnd F^BAnd sending the mixture to the second stage for matching.

（6）

Wherein x is_i ^A、x_j ^BRespectively represent the L < th >^G1 layer descriptor (calculated according to equation (3)), superscript to distinguish different pictures, subscript to distinguish different feature points, W_fAnd b_fParameters representing the fully connected layer, i.e. weight and bias, respectively, of the fully connected layer.

The second stage calculation module performs matching based on the calculated matching score. The method is divided into two calculation modes, namely inner product and Euclidean distance calculation. In the formation of F^AAnd F^BThen, define

,

。

When the difference between the two pictures is greater than or equal to the specified difference threshold (namely the first stage is started), the inner product is selected to calculate the matching score, and the larger the difference is, the better the matching score is. For F^AEach of f_i ^ABoth according to formulas (7) and F^BAll descriptors in (1) calculate a match score. For each f_i ^AIn other words, if it is calculatedIs greater than the threshold1, the corresponding descriptor is f_i ^AAnd obtaining the matching relation of the feature points. Such as f₁ ^AAnd f₃ ^BThe matching score between the feature points is the maximum and is greater than the threshold1, which indicates that the 1 st feature point in the picture a matches the 3 rd feature point in the picture B.

（7）

Preferably, the value range of the threshold value threshold1 can be set to 9-11.

When the difference between the two pictures is smaller than a specified difference threshold (namely, the first stage is closed), the Euclidean distance is selected to calculate the matching score, and the smaller the difference is, the better the matching score is. Since there is no aggregation through the first stage network, the description subset D of the input is directly input^AAnd D^BAnd (6) performing calculation. For D^AEach of d_i ^ABoth according to formulas (8) and D^BAll descriptors in (1) calculate a matching score. For each d_i ^AIn other words, if the calculated minimum match score is less than the threshold2, then the corresponding descriptor is d_i ^AAnd matching the descriptors to obtain the matching relationship of the feature points. Such as d₂ ^AAnd d₃ ^BThe matching score between the feature points is minimum and is less than threshold2, which indicates that the 2 nd feature point in picture a matches the 3 rd feature point in picture B.

（8）

Preferably, the value range of the threshold value threshold2 can be set to 0.8-1.0.

The embodiment of the invention provides a self-adaptive two-stage feature point matching method based on matching difficulty, which is a task oriented to geometry in computer vision, can achieve high matching accuracy, can flexibly adjust a network to efficiently utilize computing resources in various environments, and has obviously improved speed compared with the traditional method of simply using a neural network mode. The embodiment of the invention can achieve high accuracy rate aiming at the matching of different environments, can self-adaptively adjust the network architecture aiming at different matching difficulties, efficiently utilizes computing resources, and has greatly improved speed compared with the traditional neural network.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims

1. A self-adaptive two-stage feature point matching method based on matching difficulty is characterized by comprising the following steps:

step 1, inputting a picture pair to be matched, wherein the input picture information comprises brightness information of a picture and characteristic point information of the picture, the characteristic point information comprises position information and a descriptor, and the position information comprises a spatial position coordinate and a confidence value of the characteristic point;

the picture difference degree is as follows: carrying out size normalization processing on two pictures of the picture pair, and calculating the difference degree of the pictures according to the absolute error of the brightness information;

step 5, calculating a matching result in an inner product mode:

step 6, calculating a matching result in an Euclidean distance mode:

2. The adaptive two-stage feature point matching method based on matching difficulty degree according to claim 1, wherein in step 3, the way of performing dimension-increasing processing on the position information of the feature points of each picture is as follows: and performing dimension-increasing processing on the position information of the feature points of each picture through a multilayer perceptron.

3. The adaptive two-stage feature point matching method based on matching difficulty degree according to claim 2, wherein in step 3, the multi-layer perceptron is:

defining L to represent the number of network layers of the multilayer perceptron, wherein the first L-1 layer is a stacking structure of L-1 first stacking blocks, and the L-th layer is an addition layer; the input of the 1 st first convolution block is position information of all feature points of a picture, the channel number of a feature map output by the L-1 st first convolution block is the same as the dimension of a descriptor, the input of the addition layer is the descriptor of all feature points of the picture and the output feature map of the L-1 st first convolution block, and the first convolution block comprises a convolution layer, a batch processing layer and an activation function ReLu layer which are sequentially connected.

4. The adaptive two-stage feature point matching method based on matching difficulty degree according to any one of claims 1 to 3, wherein in the step 4, the attention aggregation processing on the new descriptor of each feature point is as follows:

5. The adaptive two-stage feature point matching method based on matching difficulty as claimed in claim 4, wherein in step 4, the top L of the graph network^G-the network structure of each of the 1 layers is two stacked second convolutional blocks, the number of input channels of the 1 st second convolutional block of each layer is 2M and the number of output channels is 2M according to the forward propagation direction, and the convolution kernel size of the 1 st second convolutional block is 1 × 2 mx 2M; each layer beingThe number of input channels of the 2 nd second convolution block is 2M, the number of output channels is M, the convolution kernel size of the 2 nd second convolution block is 1 × 2 mxm, wherein M represents the dimension of a descriptor, and the second convolution block includes a convolution layer and an activation function ReLu layer which are sequentially connected.

6. The adaptive two-stage feature point matching method based on matching difficulty degree according to claim 1, wherein in step 5, the value range of the first matching threshold is set to be 9-11.

7. The adaptive two-stage feature point matching method based on matching difficulty degree according to claim 1, wherein in step 6, the value range of the second matching threshold is set to be 0.8-1.