CN109934272B

CN109934272B - Image matching method based on full convolution network

Info

Publication number: CN109934272B
Application number: CN201910154179.9A
Authority: CN
Inventors: 桑勇; 李庆; 赵健龙; 段富海
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2022-03-29
Anticipated expiration: 2039-03-01
Also published as: CN109934272A

Abstract

The invention provides an image matching method based on a full convolution network, and belongs to the field of computer vision. The method comprises the following specific steps: the method comprises the steps of firstly, constructing a full convolution image matching network model for image matching, setting model initialization parameters and a loss function, wherein the whole full convolution matching network model consists of 8 parts; secondly, obtaining image data of a training model according to the full convolution image matching network model; thirdly, on GPU equipment, training a full convolution image matching network model by adopting image data; and fourthly, matching the images by using the trained full convolution image matching network model. The method has better matching precision and matching efficiency, solves the problem that the matching accuracy and the matching efficiency cannot be achieved simultaneously, and has more advantages compared with the traditional method.

Description

Image matching method based on full convolution network

Technical Field

The invention belongs to the field of computer vision, and relates to an image matching algorithm based on a full convolution network.

Background

Feature matching is a basic task in computer vision, plays an important role in computer vision, and is mainly applied to tasks of SFM (Structure From motion), base line registration in binocular vision, panorama construction, target identification, and high-level vision.

Before the popularity of data-driven feature learning approaches, this field of research has focused on studying local descriptor design and descriptor matching algorithms. Lowe designs a feature point with better robustness, namely an SIFT feature point [ Lowe D g, reactive image features from scale-innovative keys [ J ]. International journel of computer vision,2004,60 (2): 91-110.]. Starting from this, various manual feature points follow, and SIFT, ORB, SURF, and the like have been used so far. SIFT has the best robustness but the algorithm efficiency is not high, and the application scene with real-time requirement cannot be met; SURF is an approximate method of SIFT, and is improved in algorithm efficiency compared with SIFT, but robustness is slightly weaker than SIFT; the ORB feature points have higher algorithm efficiency but poorer robustness, and more mismatching exists when the features are used for feature matching. The matching of the descriptors is a process of measuring the similarity of the features to be matched (Hamming distance, Euler distance and the like) through a distance function and selecting proper matching points. The characteristic point matching adopts a powerful searching method to match the characteristics, when the number of the characteristic points is large, the powerful searching method has large calculation amount, and the characteristic matching is accelerated by adopting a fast Approximate Nearest Neighbor (ANN) algorithm.

The artificially designed feature descriptors have limitations and cannot simultaneously ensure the robustness and the algorithm efficiency of the descriptors. In recent years, a convolution network obtains excellent performance in high-level tasks such as target detection, image classification and image segmentation, high-performance parallel computing hardware such as a GPU (graphics processing unit), an FPGA (field programmable gate array) and matched software technologies thereof tend to be mature, and in order to solve the problem that image matching accuracy and matching efficiency cannot be achieved at the same time, the invention provides a full convolution network matching method, so that feature matching accuracy and matching speed are improved.

Disclosure of Invention

In order to solve the problem that the matching accuracy and the matching efficiency of the traditional feature matching algorithm cannot coexist, the invention provides an image matching method based on a full convolution network, and the matching accuracy and the matching efficiency of images are improved.

In order to achieve the above object, the invention adopts the technical scheme that:

an image matching method based on a full convolution network comprises the following steps:

step one, constructing a full convolution image matching network model for image matching, setting model initialization parameters and a loss function, wherein the whole full convolution matching network model consists of 8 parts and comprises the following substeps:

1) the first part comprises a convolutional layer and a pooling layer, an image formed by stacking 2-channel image blocks is input, the size of the image is 64x64, the parameter of the convolutional layer is 3x3x2, the number of convolutions is 64, and after the convolved output data is processed by a RELU activation function, a feature map with the size of 64x64x64 is output; the feature map is processed by a maximum pooling layer with a step parameter of 2, and a feature map with the size of 32x32x64 is output;

2) the second part of structure is composed of a convolutional layer and a pooling layer, the characteristic diagram obtained in the step 1) is used as the input of the second part of convolutional layer, the parameter of the convolutional layer is 3x3x64x128, the output data after convolution is processed by a RELU activation function, and the output size is 32x32x 12; the feature map passes through a maximum pooling layer with the step size of 2, and a feature map with the size of 16x16x128 is output;

3) a third partial structure comprising only the convolutional layer, this part being free of pooling layers; the feature map obtained in the step 2) is used as the input of a third part of convolution layer, the parameter of the convolution layer is 3x3x128x128, the output data after convolution is processed by a RELU activation function, and the output size is 16x16x128 feature map;

4) the fourth partial structure only comprises a convolution layer, and the part has no pooling layer; the characteristic diagram obtained in the step 3) is used as the input of the fourth part of the convolutional layer, the parameter of the convolutional layer is 3x3x128x64, the output data after convolution is processed by a RELU activation function, and the output size of the characteristic diagram is 16x16x 64;

5) the fifth part structure consists of a convolution layer and a pooling layer; the characteristic diagram obtained in the step 4) is used as the input of the fifth part of the convolutional layer, the parameter of the convolutional layer is 3x3x64x64, the output data after convolution is processed by a RELU activation function and then processed by the maximum pooling layer with the step length of 2, and the characteristic diagram with the size of 8x8x64 is output;

6) a sixth partial structure comprising only the convolutional layer, the portion being free of the pooling layer; the characteristic diagram obtained in the step 5) is used as the input of the sixth part of the convolutional layer, the parameter of the convolutional layer is 3x3x64x64, the output data after convolution is processed by a RELU activation function, and the output size is 8x8x 64;

7) the seventh part structure is a flattening layer, which has the function of smoothing the feature map with the size of 8x8x64 into a vector with the dimension of 4096x 1;

8) the eighth part is an output layer consisting of a convolution layer with convolution layer parameters of 1x1x4096x2 and a softmax activation function, and the part is the output layer of the whole full convolution matching network and is used for outputting the probability of image matching;

the convolution layer parameters in the steps 1) to 6) and 8) adopt an Xavier initialization mode, and the model training adopts a minimized cross entropy loss function E as an optimization target, which is defined as follows:

wherein the content of the first and second substances,

x_ifor the ith value of the model output vector, N is the dimension number of the model output vector, i, j belongs to [0, N ∈]And e represents a natural number,

probability value, y, representing the corresponding category of the i-th output vector_iA tag value representing an i-th dimension output vector.

Step two, obtaining image data of a training model according to the full convolution image matching network model constructed in the step one; the method comprises the following substeps:

1) feature points such as sift of the image are detected, and image blocks with the size of 64x64 are taken as training data by taking the pixel position of the sift feature point as the center. In order to obtain matching information of image blocks of different images, a traditional image matching method is required to be used for pre-matching the features in advance to obtain a correct label. This process requires manual verification to ensure the correctness of the label.

2) After the training data and the corresponding labels are obtained, in order to enhance the robustness of the algorithm, the training data needs to be enhanced to obtain final image data. The enhancement treatment specifically comprises the following steps: and exchanging the channel sequence of the two images, rotating one image by 90 degrees and 180 degrees, and adding random noise in the image.

And step three, on GPU equipment, training the full convolution image matching network model of the first step by adopting the image data of the second step to obtain a trained model.

Matching image block pairs in the images to be matched by adopting the trained model; the method comprises the following substeps:

1) detecting sift characteristic points of the matched image, and intercepting image blocks with the size of 64x64 by taking the characteristic points as centers;

2) selecting a plurality of image blocks with the size of 64x64 from the two images to be matched, stacking the image blocks two by two to obtain an image block pair to be predicted, and inputting the image block pair to be predicted into a trained full-convolution image matching network to obtain the matching result of the two images to be matched. If the first picture obtains n image blocks in total and the second picture obtains m image blocks in total, the number of stacked images to be predicted is n × m.

The invention has the beneficial effects that: the method effectively solves the problem that the matching accuracy and the matching efficiency cannot be achieved simultaneously, and has more advantages compared with the traditional method.

Drawings

FIG. 1 is a flow chart of the algorithm of the present invention;

FIG. 2 is a diagram of a full convolution matching network in accordance with the present invention;

table 1 is a table of parameters for a full convolution matching network in the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and detailed description:

the image matching method based on the full convolution network comprises the following steps:

the method comprises the following steps: designing a full convolution image matching network, and setting model initialization parameters and a loss function;

constructing a full convolution image matching network model, wherein the model comprises 8 parts: the first part comprises a convolutional layer and a pooling layer, an image formed by stacking 2-channel image blocks is input, the size of the image is 64x64, the parameter of the convolutional layer is 3x3x2, the number of convolutions is 64, the convolved data is processed by a RELU activation function, the size of an output characteristic diagram is 64x64x64, and then the characteristic diagram is processed by a maximum pooling layer with the step size parameter of 2 to obtain a characteristic diagram of 32x32x 64;

the structure of the second part of the full convolution matching network model is similar to that of the first part, the convolution parameter is 3x3x64x128, convolution is carried out, then the RELU activation function is carried out to process convolution output, the output feature map is 32x32x128, and then the feature map is subjected to the maximum pooling layer with the step length of 2 to obtain a feature map with the size of 16x16x 128;

the structure of the third part of the full convolution matching network model is slightly different from the structures of the first part and the second part, the part does not have a pooling layer, the parameter of the convolution layer of the part is 3x3x128x128, the activation function is a RELU activation function, and the size of the feature map output after the third part of processing is 16x16x 128;

the fourth part of the full convolution matching network has the structure that the convolution parameter is 3x3x128x64, the activation function is the RELU activation function, and the size of the output feature map is 16x16x 64;

the fifth part of the structure of the full convolution matching network is that the convolution parameter is 3x3x64x64, the activation function is a RELU activation function, the size of the output feature map is 8x8x64 after the maximum pooling layer processing with the step length of 2;

the sixth part of the structure of the full convolution matching network is that the convolution parameter is 3x3x64x64, the activation function is the RELU activation function, and the size of the output feature map is 8x8x 64;

the seventh partial structure of the full convolution matching network is a flattening layer, and the function of the layer is to smooth the feature map with the size of 8x8x64 into a vector with the dimension of 4096x 1;

the eighth part of the full convolution matching network is an output layer consisting of a convolution layer with the parameter of 1x1x4096x2 and a softmax activation function, and the eighth part is the output layer of the whole full convolution matching network and outputs the probability of image matching;

2) initializing each convolutional layer parameter in the network by adopting an Xavier initialization mode, training the model by taking a minimized cross entropy loss function as an optimization target, wherein the minimized cross entropy loss function is defined as follows:

wherein

x_iThe ith value of the model output vector is represented, and N is the dimension number of the model output vector.

Step two: preparing image data of a training model according to the designed full convolution network model;

1) detecting the feature points such as sift, taking the pixel position of the feature point of sift as the center, taking the image block with the size of 64x64 as a training set, in order to obtain the matching information of the image blocks of different images, pre-matching the features by using a traditional image matching method in advance, manually verifying after matching, and ensuring the correctness of the label. For matching pairs of image blocks, we stack them as one 2-channel image and number the images, with the label that should have the same number as the image, 1 if the pair matches, and 0 otherwise.

2) After the training data and the corresponding labels are obtained, in order to enhance the robustness of the algorithm, the data needs to be subjected to light enhancement processing, and the method mainly performs the following light enhancement processing on the data: and exchanging the channel sequence of the two images, rotating one image by 90 degrees and 180 degrees, and adding random noise in the image.

Step three: on GPU equipment, training a full convolution matching network model by using prepared image data;

1) training the model by adopting a stochastic gradient descent optimization algorithm, setting the initial learning rate to be 0.1, setting the learning rate of the model to be reduced by 10 times every 10000 times of iteration, setting the maximum iteration number to be 30000, and setting the number of single iteration input images to be 128.

2) selecting a plurality of image blocks with the size of 64x64 from the two images to be matched, stacking the image blocks two by two to obtain an image block pair to be predicted, and inputting the image block pair to be predicted into a trained full-convolution image matching network to obtain the matching result of the two images to be matched. For example, if the first picture obtains n image blocks in total and the second picture obtains m image blocks in total, the number of stacked images to be predicted is n × m.

TABLE 1

The above-mentioned embodiments only express the embodiments of the present invention, but not should be understood as the limitation of the scope of the invention patent, it should be noted that, for those skilled in the art, many variations and modifications can be made without departing from the concept of the present invention, and these all fall into the protection scope of the present invention.

Claims

1. An image matching method based on a full convolution network is characterized by comprising the following steps:

1) detecting the sift characteristic points of the image, and taking the pixel position of the sift characteristic points as the center, and taking the image blocks with the size of 64x64 as training data; in order to obtain matching information of image blocks of different images, the features need to be pre-matched by using a traditional image matching method in advance to obtain correct labels; the process needs manual verification to ensure the correctness of the label;

2) after the training data and the corresponding labels are obtained, in order to enhance the robustness of the algorithm, the training data needs to be enhanced to obtain final image data; the enhancement treatment specifically comprises the following steps: exchanging the channel sequence of the two images, rotating one image by 90 degrees and 180 degrees, and adding random noise in the image;

step three, on GPU equipment, training the full convolution image matching network model of the first step by adopting image data of the second step to obtain a trained model;

2) selecting a plurality of image blocks with the size of 64x64 from the two images to be matched, stacking the image blocks two by two to obtain an image block pair to be predicted, and inputting the image block pair to be predicted into a trained full-convolution image matching network to obtain the matching result of the two images to be matched.

2. The image matching method based on the full convolution network according to claim 1, wherein the convolution layer parameters in the first step from step 1) to step 6) and in the step 8) are initialized by Xavier, and the model training uses a minimum cross entropy loss function E as an optimization target, which is defined as follows:

wherein the content of the first and second substances,