CN113190706A

CN113190706A - Twin network image retrieval method based on second-order attention mechanism

Info

Publication number: CN113190706A
Application number: CN202110410902.2A
Authority: CN
Inventors: 廖开阳; 范冰; 郑元林; 章明珠; 黄港
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-07-30

Abstract

The invention discloses a twin network image retrieval method based on a second-order attention mechanism, which comprises the following steps of: performing background subtraction processing on the query image and the training image; adding a second-order attention mechanism after the convolution layer of the convolution neural network to obtain a second-order attention convolution neural network; respectively inputting the processed query image and the processed training image into a second-order attention convolution neural network for feature extraction to obtain a query image feature and a training image feature; carrying out global average pooling and L2 normalization on the query image features and the training image features to obtain query image descriptors and training image descriptors; similarity measurement is carried out on the query image descriptors and the training image descriptors, and the training image descriptors are ranked according to the similarity to obtain a ranking result; and rearranging the sequencing result, and retrieving to obtain a training image most similar to the query image. The method can improve the retrieval precision, save the retrieval time and realize the aims of rapidness, high efficiency and accuracy.

Description

Twin network image retrieval method based on second-order attention mechanism

Technical Field

The invention belongs to the technical field of image processing methods, and relates to a twin network image retrieval method based on a second-order attention mechanism.

Background

In the internet era, heterogeneous data such as images, videos, audios, texts and the like are increasing at an alarming rate every day, particularly with the popularity of social networking sites such as Flickr, Facebook and the like. For example, Facebook registers more than 10 hundred million users, uploading more than 10 hundred million pictures per month; the number of pictures uploaded by the users in 2015 year of the Flickr picture social network site reaches 7.28 hundred million, and about 200 million pictures are uploaded by the users on average each day; 286 hundred million pictures are stored in the back end system of the Taobao network of the largest electronic commerce system in China. For these massive pictures containing rich visual information, how to conveniently, quickly and accurately query and retrieve the images needed or interested by users in these vast image libraries becomes a hotspot of research in the field of multimedia information retrieval. The image retrieval method based on the content gives full play to the advantage that a computer is longer than a computer for processing repeated tasks, and frees people from manual labeling which needs to consume a large amount of manpower, material resources and financial resources. With the development of the ten years, the content-based image retrieval technology has been widely applied to the aspects of life such as search engines, electronic commerce, medicine, textile industry, leather industry and the like.

Image retrieval enables efficient querying and management of image libraries, which refers to retrieving images from large-scale image databases that are relevant to text queries or visual queries. Currently, text-based image retrieval (TBIR), content-based image retrieval (CBIR), and semantic-based image retrieval (SBIR) are mainly used for image retrieval. The image retrieval based on the text mainly uses the text to describe the characteristics of the image, and then the image retrieval is carried out through text matching. Text-based retrieval techniques have been developed and matured at present, such as probabilistic methods, Page-Rank methods, summarization methods, location methods, classification or part-of-speech tagging methods, clustering methods, etc. (Cheng A, Friedman E.Manipulaty of Page Rank under systematic strategies [ J ]. NetEcon, 2006.). The content-based image retrieval technology is an image retrieval technology for inquiring and analyzing the content of an image, such as the shape, texture and other low-level features of the image. The image features are extracted by mathematically describing the visual content of the image, and the mathematical description of these low-level features is used to reflect the visual content of the image itself. The image retrieval technology based on the semantics is different from CBIR in that SBIR is an important method and idea for solving the semantic gap, not only takes low-level visual features into consideration, but also takes high-level features of images into consideration, such as image information in aspects of scenes, emotion, spatial relationship and the like. In 2012, Krizhevsky et al (Krizhevsky a, Sutskever I, Hinton G e. ImageNet classification with deep connected neural networks [ c ]// Advances in neural information processing systems,2012: 1097-. The deep learning algorithm, particularly the convolutional neural network, has the best retrieval effect, and utilizes the combination of a plurality of pooling layers and convolutional layers to obtain the visual characteristics of the image, and is combined with a feedback and classification technology to realize a better retrieval result.

The problems faced at present are that the precision of image retrieval needs to be further improved, and the intellectualization and diversification of retrieval methods are increased. How to quickly, efficiently and accurately retrieve images required by users is an important topic in the field of image retrieval.

Disclosure of Invention

The invention aims to provide a twin network image retrieval method based on a second-order attention mechanism, which solves the problem of low image retrieval precision in the prior art.

The invention adopts the technical scheme that a twin network image retrieval method based on a second-order attention mechanism comprises the following steps:

step 1, performing background subtraction processing on a query image and a training image;

step 2, adding a second-order attention mechanism after the convolution layer of the convolution neural network to obtain a second-order attention convolution neural network, wherein the second-order attention mechanism is used for processing the output of the convolution layer to obtain the input of the next layer;

step 3, inputting the query image and the training image processed in the step 1 into a second-order attention convolution neural network respectively for feature extraction to obtain a query image feature and a training image feature;

step 4, carrying out global average pooling and L2 normalization on the query image features and the training image features to obtain a query image descriptor D2 and training image descriptors D2 and D_s1 denotes a descriptor for each training diagram, s 1 … n;

step 5, carrying out similarity measurement on the query image descriptors and the training image descriptors, and sequencing the training image descriptors according to the similarity to obtain a sequencing result;

and 6, rearranging the sequencing result, and retrieving to obtain a training image most similar to the query image.

The invention is also characterized in that:

the convolutional neural network in step 2 comprises 2 × 3 pooling layers, 2 × 2 fully connected layers, and 3 × 1 convolutional layers, and the size of the filter in the convolutional layers is 5 × 5.

The specific process of processing the output of the convolutional layer in the step 2 to obtain the input of the next layer is as follows:

step a, representing the C-dimensional characteristic diagram with the size of H multiplied by W as a characteristic diagram F ═ F₁,…,f_c]The size is H multiplied by W multiplied by C; mapping the features to f_cReshaped to a feature matrix X with dimension C and feature S ═ W × H, then the covariance matrix is calculated by:

in the above formula, the first and second carbon atoms are,

i is an sxs matrix, and 1 is an sxs unit matrix;

b, carrying out covariance normalization on the covariance matrix sigma to obtain:

∑＝U∧U^T (2)；

in the above formula, U is an orthogonal matrix, Λ ═ diag (λ)₁,…,λ_C) Is a diagonal matrix with eigenvalues;

step c, carrying out convolution normalization on the covariance matrix sigma processed in the step b, and converting the covariance matrix sigma into a power of the characteristic value:

in the above formula, alpha is positive real number and lambada^α＝diag(λ^α ₁,…,λ^α _C)；

Step d, making

By shrinking

Obtaining channel statistics z, and z belongs to R^C×1(ii) a The c-th dimension of the channel statistic z is then calculated as:

in the above formula, H_GCP(. -) represents a global covariance pool function;

step e, applying gate control mechanism to the statistic value z of the channel c_cConverting to obtain the scaling factor w in the channel c_c：

w_c＝f(w_Uδ(W_Dz_c)) (5)；

In the above formula, W_D、W_UF (-) and delta (-) represent functions of sigmoid and RELU as weight values of convolutional layers;

using constriction in channel cLet factor w_cMapping f to features_cAdjusting to obtain a characteristic diagram

The next layer of input is obtained:

the specific process of the step 5 is as follows: computing a query image descriptor D2 and each training image descriptor D_s1 Euclidean distance d_s(x_sY), wherein D2 ═ (y1 … yn), D_s1＝(x_s1…x_sn)：

According to Euclidean distance d_s(x_sAnd y) sequencing the training images from low to high to obtain a sequencing result.

The specific process of the step 6 is as follows: selecting a plurality of training images ranked at the top in the sequencing result, calculating the average vector of the characteristic vectors, rearranging the result according to the average vector, and retrieving to obtain the training image most similar to the query image.

The invention has the beneficial effects that:

the invention relates to a twin network image retrieval method based on a second-order attention mechanism, which is characterized in that the second-order attention mechanism is added in the convolution process to strengthen second-order spatial information, and feature mapping is reweighed, so that the prominent image position is emphasized and then used for description, and the local and global performances of an image descriptor can be improved; the method can improve the retrieval precision, save the retrieval time and realize the aims of rapidness, high efficiency and accuracy.

Drawings

FIG. 1 is a flow chart of a twin network image retrieval method based on a second order attention mechanism according to the present invention;

FIG. 2 is a detailed flowchart of a twin network image retrieval method based on a second-order attention mechanism according to the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

A twin network image retrieval method based on a second-order attention mechanism is disclosed, as shown in FIG. 1 and FIG. 2, and specifically comprises the following steps:

step 1, selecting a data set, and performing background subtraction processing on a query image and a training image in the data set by adopting a background subtraction algorithm;

the data set selected in this example is CIFAR-10. The training pictures comprise ten categories, wherein 50000 training pictures and 10000 training pictures are total. The CIFAR-10 contains real objects in the real world, not only is the noise large, but also the proportion and the characteristics of the objects are different, and great difficulty is brought to identification. The background subtraction algorithm mainly uses a background subtraction subtractor MOG2 algorithm in python opencv, the method defaults to use the previous 120 frames of images for modeling, and a probability foreground segmentation algorithm is used, namely a Bayesian inference method is used for identifying whether an object is a foreground or not; the algorithm compares that a new observed object in the image has higher weight than an old observed object by a self-adaptive method, so that the algorithm can adapt to the change of illumination; some of the morphological operations such as the close and open operations are used to remove unwanted noise.

And 2, adding a second-order attention mechanism after the convolutional layers of the convolutional neural network, so that the dependency between the characteristics of each convolutional layer can be improved, and obtaining the second-order attention convolutional neural network. By sequentially adding a second-order attention mechanism module behind each convolution layer for experiment, finding out the convolution layer most suitable for adding the second-order attention mechanism according to the comparison result;

specifically, the convolutional neural network includes 2 × 3 pooling layers, 2 × 2 fully-connected layers, and 3 × 1 convolutional layers, in which 32, and 64 filters are respectively included, and the size of the filter in the convolutional layer is 5 × 5. Obtaining corresponding feature mapping after convolution processing of the query image and the training image, and continuously updating weights in the network by using a Loss function so as to achieve the optimal training effect;

the specific process of processing the output of the convolutional layer to obtain the input of the next layer is as follows:

step a, representing the C-dimensional characteristic diagram with the size of H multiplied by W as a characteristic diagram F ═ F₁,…,f_c]The size is H multiplied by W multiplied by C; mapping the features to f_cReshaped to a feature matrix X with dimension C and S ═ W × H, then the covariance matrix is calculated by:

in the above formula, the first and second carbon atoms are,

i is an sxs matrix, and 1 is an sxs unit matrix;

step b, as the covariance matrix sigma is symmetrical and semi-positive, the covariance matrix sigma has eigenvalue decomposition (EIG); and carrying out covariance normalization on the covariance matrix sigma to obtain:

∑＝U∧U^T (2)；

in the above formula, alpha is positive real number and lambada^α＝diag(λ^α ₁,…,λ^α _C) (ii) a When α is 1, no normalization is performed; when alpha is<1, it shrinks the characteristic value larger than 1.0 non-linearly, and records the characteristic value smaller than 1.0; according to the data, alpha is 1/2 which has the best effect; in the present embodiment, α — 1/2 is set;

and d, taking the normalized covariance matrix as a channel descriptor through global covariance pooling. In particular, make

By shrinking

Obtaining a channel statistic value z, wherein z belongs to R^C×1Then the statistical value z of the channel c_cThe calculation method is as follows:

in the above formula, H_GCP(. -) represents a global covariance pool function;

w_c＝f(w_Uδ(W_Dz_c)) (5)；

In the above formula, W_D、W_USetting the channel dimension of the features as C/r and C respectively for the weight of the convolution layer; f (-) and delta (-) denote sigmoid and RELU functions;

using a scaling factor w in channel c_cMapping f to features_cAdjusting to obtain a characteristic diagram

I.e. the input to the next layer:

step 4, carrying out global leveling on the query image features and the training image featuresAfter pooling and L2 normalization, a softmax function of the query image features and the training image features is given for processing to obtain a query image descriptor D2 and a training image descriptor D_s1，D_s1 denotes a descriptor for each training diagram, s 1 … n;

step 5, carrying out similarity measurement on the query image descriptor and each training image descriptor, and sequencing the training image descriptors according to the similarity to obtain a sequencing result;

specifically, the similarity measure is calculated by computing the query image descriptor D2 and each training image descriptor D_s1 Euclidean distance d_s(x_sY), wherein D2 ═ (y1 … yn), D_s1＝(x_s1…x_sn)：

According to Euclidean distance d_s(x_sY) ordering the training images from low to high (similarity from large to small), Euclidean distance d_s(x_sAnd y) is smaller, the similarity is larger, namely the training image which is ranked more front is more similar to the query image, and the ranking result is obtained.

Specifically, several training images ranked at the top in the ranking result are selected, the average vector of the feature vectors of the training images is calculated, the result is rearranged according to the average vector, and the training image most similar to the query image is obtained through retrieval.

Through the mode, the twin network image retrieval method based on the second-order attention mechanism is characterized in that the second-order attention mechanism is added in the convolution process to strengthen second-order spatial information, and feature mapping is reweighed, so that the prominent image position is emphasized and then used for description, and the local and global performances of an image descriptor can be improved; the method can improve the retrieval precision, save the retrieval time and realize the aims of rapidness, high efficiency and accuracy.

Claims

1. A twin network image retrieval method based on a second-order attention mechanism is characterized by comprising the following steps:

step 5, carrying out similarity measurement on the query image descriptors and the training image descriptors, and sorting the training image descriptors according to the similarity to obtain a sorting result;

2. The method according to claim 1, wherein the convolutional neural network in step 2 comprises 2 x 3 pooling layers, 2 x 2 fully connected layers, and 3 x 1 convolutional layers, and the size of the filter in the convolutional layers is 5 x 5.

3. The twin network image retrieval method based on the second-order attention mechanism as claimed in claim 1, wherein the specific process of processing the output of the convolutional layer in step 2 to obtain the input of the next layer is as follows:

step a, C-dimensional features with the size of H multiplied by WThe diagram is a characteristic diagram F ═ F₁,…,f_c]The size is H multiplied by W multiplied by C; mapping the features to f_cReshaped to a feature matrix X with dimension C and feature S ═ W × H, then the covariance matrix is calculated by:

in the above formula, the first and second carbon atoms are,

i is an sxs matrix, and 1 is an sxs unit matrix;

∑＝U∧U^T (2)；

Step d, making

By shrinking

in the above formula, H_GCP(. -) represents a global covariance pool function;

w_c＝f(w_Uδ(W_Dz_c)) (5)；

The next layer of input is obtained:

4. the twin network image retrieval method based on the second-order attention mechanism as claimed in claim 1, wherein the specific process of step 5 is as follows: computing the query image descriptor D2 and each training image descriptor D_s1 Euclidean distance d_s(x_sY), wherein D2 ═ (y1 … yn), D_s1＝(x_s1…x_sn)：

According to the Euclidean distance d_s(x_sAnd y) sequencing the training images from low to high to obtain a sequencing result.

5. The twin network image retrieval method based on the second-order attention mechanism as claimed in claim 1, wherein the specific process of step 6 is as follows: selecting a plurality of training images ranked at the top in the sequencing result, calculating the average vector of the characteristic vectors, rearranging the result according to the average vector, and retrieving to obtain the training image most similar to the query image.