CN116258938A

CN116258938A - Image retrieval and identification method based on autonomous evolution loss

Info

Publication number: CN116258938A
Application number: CN202211577810.4A
Authority: CN
Inventors: 王鹏; ***; 张艳宁; 吴瑞祺; 杨路
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-06-13

Abstract

The invention discloses an image retrieval and identification method based on autonomous evolution loss, and provides a brand new Softmax-like measurement loss function, which shares parameters with an original Softmax loss function, and has three main differences, namely: different gamma, L2 normalization features and W to stop gradient updates _j . Because the characteristics used in the gradient stop Softmax loss function are L2 normalized, the distance measurement of the training stage is consistent with that of the testing stage, and the parameters are shared with the original Softmax loss function, so that the network can obtain good characterization (class center), and the problem that training is difficult to converge is solved; the central gradient stops updating in the gradient stop Softmax loss function, but the sampleThe present feature does not stop gradient updates and such a setting may force the sample feature to approach the class center on the hypersphere. The influence of applying the depth measurement learning to the image retrieval task to the model learning effect by only learning the Softmax loss function is better solved.

Description

Image retrieval and identification method based on autonomous evolution loss

Technical Field

The invention belongs to the field of image retrieval, and particularly relates to an image retrieval and identification method based on autonomous evolution loss.

Background

The basic form of an image retrieval task is to give a query image containing a specific instance (e.g., a specific object, scene, building, etc.), and then find an image containing the same instance from the database image. Depth metric learning is one of the important methods widely used in image retrieval tasks.

In an autonomously evolving metric learning task, deep metric learning (Deep metric learning, DML) aims at learning similarity metrics, which can map samples to a high-dimensional space. In high-dimensional space, the closer the samples of the same instance are, the more distant the samples of different instances are. Typical depth metric learning applications include image retrieval, person re-identification, and the like. Popular methods of depth metric learning include pairwise based methods and Softmax based methods. The pairwise based approach focuses on finding efficient methods to improve the sample weighting strategy of existing pairwise losses (such as contrast losses and triple losses). The pair-wise based approach directly affects the distance between pairs of points in the embedding space, which is closely related to the objectives of the DML. With respect to Softmax-based methods, some existing methods believe that good performance can also be achieved using Softmax loss to train the model. In contrast to pairwise based approaches, softmax based approaches can be viewed as approximating each class using an agent, and using all agents to provide a global context for each training iteration.

Prior studies found that optimizing Softmax-based methods corresponds to an approximate world optimizer of basic pairwise loss, indicating that minimizing Softmax loss is equivalent to maximizing a differentiated view of mutual information between features and labels. In fact, when training a Softmax-based depth metric learning model, the inner product without L2 normalization (i.e., the last fully connected layer) is the most widely used similarity metric, but features are typically L2 normalized during the test phase, meaning that the distance metric used during training is different from the distance used during the test phase. To make up for this gap, a simple approach is to use L2 normalization directly during training. However, the introduction of L2 normalization standardization makes it difficult for the network to converge, resulting in failure of training.

The prior studies suggest that this is mainly because the L2 normalized inner product output ranges only from [ -1,1], preventing the probability distribution from approaching 1 even if the samples are well separated. To address this convergence problem, researchers have attempted to add a scaling layer after the inner product. The scaling layer has a learnable parameter for scaling the internal output to a larger value than 1, thereby facilitating the Softmax penalty to continue to decrease, thereby helping the network to converge. However, this approach does not guarantee that the network can learn the optimal scaling parameters.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention aims to provide an image retrieval and identification method based on autonomous evolution loss, which solves the problem that only a Softmax loss function is learned to influence a model learning effect when depth measurement learning is applied to an image retrieval task.

The invention is realized by the following technical scheme:

the image retrieval and identification method based on the autonomous evolution loss is characterized by comprising the following steps:

(1) Using a ResNet50 model as a backbone network and pre-training on an ImageNet large classification dataset;

(2) The generalized mean pooling is used for replacing global mean pooling, a batch normalization layer without a scaling item and a bias item is added on a main network, the last BN-ReLU module in the main network is removed, and the recall rate in the test is calculated by using L2 normalization Euclidean distance;

(3) All input images are adjusted to 256 x 256 resolutions and cut to 224 x 224 resolutions, data enhancement operation is not carried out on the input data, and the input data is only sampled to 256 x 256 image sizes;

(4) Training the model for 100 rounds, setting a parameter learning rate by adopting a cosine annealing algorithm, and setting gamma=30 as a default value;

(5) Using label smoothing on the Softmax loss function, and stopping the Softmax loss function by gradient to start adding training when the return value of the Softmax loss function is less than 3;

the gradient stop Softmax loss function is as follows equation (2):

where N is the number of samples in each input batch, c is the number of classes in the training set, f _i Is characteristic of the ith sample, y _i Label for the ith sample, W _j Is the j-th column of the last full connection layer, corresponds to the j-th class,

represents L2 normalization, and->

Indicating that it is not allowed to pass through W _j Gradient updating is performed, and gamma is a predefined scalar; softplus (x) =log (1+e ^x )，/>

(6) Fixing model parameters, updating network parameters without using a random gradient descent algorithm, and using a network only as an extractor of image characteristics;

(7) Extracting network output characteristics F from ResNet-50 characteristics of the BN-ReLU module, wherein the ResNet-50 characteristics of the BN-ReLU module are removed after the gradient stop Softmax loss function is deployed;

(8) Obtaining a characteristic F by model characteristic reasoning aiming at a query sample _q Extracting and storing the features of all images in the image library as a feature sequence { F } ₁ ,…,F _m }；

(9) Computing query sample characteristics F _q And euclidean distance of all image features in the image library: d= |f _q -F _i || ₂ ,i＝1,2,3,…,m；

(10) Obtaining distanceSequence d= [ D ] ₁ ,d ₂ ,…,d _m ]；

(11) And (3) reordering the D by the distance, taking L images closest to the query sample, and if the images with the same ID as the query sample exist in the images, considering that the image retrieval is successful.

Further, in the step (5), the Softmax loss function and the gradient stop Softmax function are used in combination, and the total loss is expressed as follows:

L＝L _softmax +L _SGSL (1)

the gradient stop Softmax loss function and the original Softmax loss function share parameters.

Further, the image library employed was constructed as follows:

the CUB-200-2011 has 200 classes, 11788 pictures, the first 100 classes 5864 images are used for training, and the other classes 5924 images are used for testing;

CAR-196 has 198 classes and 16,185 images, the first 98 classes are used for training 8054 images, and the other 98 classes are 8131 images for testing;

stanford Online Products there are 22634 classes followed by 120053 pictures, the first 11318 classes 59551 images are used for training, and the other 11316 classes 60502 images are used for testing;

in-shop imaging has 50 fine-grained categories and 1000 attributes, containing more than 80 tens of thousands of images.

The invention provides an autonomous evolution loss-based image retrieval and identification method, and provides a brand new Softmax-like measurement loss function, which shares parameters with an original Softmax loss function, and has three main differences, namely: different gamma, L2 normalization features and W to stop gradient updates _j . Because the features used in the gradient stop Softmax loss function are L2 normalized, the distance metric of the training phase is consistent with the test phase, and sharing parameters with the original Softmax loss function also enables the network to obtain good characterization (class center), thereby solving the problem that training is difficult to converge.

At the same time, the gradient of the class center stops updating in the gradient stop Softmax loss function, but the sample feature does not stop gradient updating, and such a setting can force the sample feature to approach the class center on the hypersphere. The influence of applying the depth measurement learning to the image retrieval task to the model learning effect by only learning the Softmax loss function is better solved.

Drawings

FIG. 1 is a flowchart of an overall image retrieval method;

FIG. 2 is a schematic diagram of an original Softmax loss function and a gradient stop Softmax loss function according to the present invention;

Detailed Description

The invention will now be described in further detail with reference to specific examples, which are intended to illustrate, but not to limit, the invention.

As shown in fig. 1, the invention provides an image retrieval and identification method based on autonomous evolution loss, which comprises the following specific implementation processes:

(1) Using ResNet50 as the backbone network and pre-training on the ImageNet large classification dataset;

(2) A generalized mean pooling is used for replacing a global mean pooling, a batch normalization layer without a scaling item and a bias item is added on a backbone network, and an L2 normalization Euclidean distance is used for calculating recall rate during testing;

(3) All input images were adjusted to 256 x 256 resolution and cropped to 224 x 224 resolution, with a number of batch input samples of 64 images (4 images per ID input, 16 IDs total). The method comprises the steps that data enhancement operation is not carried out on input data, and the input data is only sampled to the image size of 256 x 256;

(4) The model is trained for 100 rounds, a cosine annealing algorithm is adopted to set a parameter learning rate, and gamma=30 is set as a default value;

(6) The model parameters are fixed, network parameter updating is not performed through a random gradient descent algorithm, and only the network is used as an extractor of image features.

(7) In the actual reasoning process, the gradient stop Softmax penalty function will be deployed and the output feature F of the ResNet-50 feature extraction network of the removed BN-ReLU technique will be applied.

(8) For the query sample, the obtained feature F is subjected to model feature reasoning _q The invention extracts the characteristics of all images in the image library and stores the extracted characteristics as a characteristic sequence { F } ₁ ,…,F _m }。

(9) Calculation F _q And euclidean distance of all image features in the image library: d= |f _q -F _i || ₂ ,i＝1,2,3,…,m

(10) Obtaining a distance sequence D= [ D ] ₁ ,d ₂ ,…,d _m ],

(11) The invention takes L images closest to the query sample by reordering D by distance, and if the images with the same ID as the query sample exist in the images, the image retrieval is considered to be successful.

As shown in FIG. 2, the scheme innovation mainly comprises a gradient stop Softmax loss function and a BN-ReLU removal technology.

1. Gradient stop Softmax loss function

Gradient Stop Softmax penalty function (Stop-Gradient Softmax Loss, SGSL), many existing methods delete the bias term in the last fully connected layer and follow this setup when training the classification network for metric learning. For a better understanding of the method of the present invention, the original Softmax and its variants are briefly reviewed here. The Softmax penalty function for the original unbiased term is shown in equation (1) below:

where N is the number of samples in each input batch, c is the number of classes in the training set, f _i Is characteristic of the ith sample, y _i Is the label of the i-th sample. W (W) _j Is the j-th column of the last full connection layer, corresponding to the j-th class. In addition, there is Softplus (x) =log (1+e ^x ) And

the gradient stop Softmax loss function formula proposed by the invention is similar to the standard Softmax loss function, but has three differences:

1) Gamma in the gradient stop Softmax loss function formula is not fixed to be 1, but a larger value, and gamma can be regarded as a scaling scale parameter for controlling loss, but unlike the traditional scaling scale parameter, gamma is only W _j And f _i From a different category;

2)W _j and f _i Are all L2 normalized;

3) The invention does not allow the W to pass through _j And carrying out gradual change update.

Therefore, the gradient stop Softmax loss function proposed by the present invention is defined as the following equation (2):

wherein the method comprises the steps of

Represents L2 normalization, and->

Indicating that it is not allowed to pass through W _j Gradient updates are made, gamma being a predefined scalar. The other symbols have the same meaning as in formula (1). In the experiment, the original Softmax loss function and the gradient stop Softmax function provided by the invention are combined, and the total loss is expressed as the following formula:

L＝L _softmax +L _SGSL (3)

as illustrated in fig. 2, the original Softmax loss function and the gradient stop Softmax loss function proposed by the present invention, wherein the gradient stop Softmax loss function and the original Softmax loss function share parameters. Analysis of the principle of action of the gradient stop Softmax loss function proposed by the present invention, softplus (x) =log (1+e ^x ) Is a convex monotonically increasing functionTo be regarded as a smooth version of the positive partial function max (0, x), and so-called logarithmic index, i.e

Is a form of function frequently encountered in dynamic discrete selection models, which can be viewed as a smoothed version of the function that selects the largest among a set of data. And the larger γ is, the smaller the error is, so the present invention sets γ=30 to a default value in the formula (2).

Based on the above analysis, equation (2) can be expressed approximately as:

wherein [] ₊ Represents max ([. Cndot.],0)，

Indicating that it is not allowed to pass through W _j Gradient updating and cosine similarity

The normalized version of the inner product of the two vectors is used for measuring the similarity independent of the size between the features, and is equivalent to the L2 normalized Euclidean distance.

From equation (4), it can be seen that the optimization objective of the gradient stop Softmax loss function is to let f _i And W is equal to _j The cosine similarity between them is greater than f _i And (3) with

Maximum cosine similarity between them. In other words, the gradient stop Softmax penalty function requires that the network-learned features should be closer to the characterization of its class (here far and near measured using L2 normalized euclidean distances) and farther from the characterization of the other class, which is closely related to the goal of depth metric learning. In addition, the gradient stop Softmax loss function provided by the invention does not allow gradient update through W, so that the convergence difficulty problem caused by applying the gradient stop Softmax loss function in the actual training learning of a model is less.

2. BN-ReLU removal technique

In deep metric learning, a feature extraction network ResNet50 without the last fully connected layer is often used as the backbone network. Many methods add batch normalization (Batch Normalization, BN) of non-scaling and paranoid terms over the backbone network, as it can smooth and normalize the feature distribution, enhancing compactness within the class. But such a process would bring the last three layers of the backbone network into the form of BN-ReLU-BN, which would increase the learning burden of the model. While adding a continuous BN-ReLU module does not bring any new information for the output features, some information useful for metric learning may be lost instead. Therefore, the invention applies the technology of removing BN-ReLU, and removes the last BN-ReLU module in the backbone network.

1. Data set selection

In the aspect of data set setting, the method uses a general image retrieval data set to evaluate the identification capability of the method in a fine-granularity image retrieval task, and analyzes the improvement of the model performance brought by the gradient stop Softmax loss function. Specifically, the data set used in the present invention is composed of:

1) The CUB-200-2011 has 200 classes and 11788 pictures. The first 100 classes (5864 images) are used for training and the remaining classes (5924 images) are used for testing.

2) CAR-196 has 198 classes and 16,185 images. The first 98 classes were used for training (8054 images) and the other 98 classes (8131 images) were used for testing.

3) Stanford Online Products there are 22634 classes followed by 120053 pictures. The first 11, 318 classes (59551 images) are used for training and the other 11316 classes (60502 images) are used for testing.

4) In-shop imaging is a large garment dataset with comprehensive annotations. It has 50 fine-grained categories and 1000 attributes and contains over 80 tens of thousands of images that are annotated with a large number of attributes, clothing landmarks, and correspondence of images taken in different scenes, including shops, street snapshots, and consumers.

2. Implementation detail setting

The experiments of the present invention were completed on a NVIDIA GTX 2080Ti graphics processor using the PyTorch deep learning framework. The invention uses different numbers of graphics processors for training according to the size of the data set, specifically 4 graphics processors for Stanford Online Products data set and 2 graphics processors for other data sets. All experiments used ResNet50 as the backbone network and were pre-trained on ImageNet large class datasets, with generalized mean pooling instead of global mean pooling. Similar to most of the current depth metric learning methods, a batch normalization layer without scaling and biasing terms is added on the backbone network, and the recall rate at test is calculated using the L2 normalized euclidean distance.

In terms of parameter setting, all input images were adjusted to 256×256 resolutions and clipped to 224×224 resolutions, and the number of batch input samples was 64 images (4 images for each ID input, 16 IDs for total). The model is trained for 100 rounds, and a cosine annealing algorithm is adopted to set the parameter learning rate. γ=30 is set as a default value. To build a robust model with good generalization capability, label smoothing was used for Softmax, and for training stability, the gradient stopped Softmax loss function was only started to join the training if the return value of Softmax was less than 3.

3. Model application

At this stage, the present invention does not perform data enhancement operation on the input data, but samples the input data only to an image size of 256×256. Meanwhile, the invention fixes the model parameters, does not update network parameters through a random gradient descent algorithm, and only uses the network as an extractor of image characteristics. In the actual reasoning process, the present invention uses the output feature F of the ResNet-50 feature extraction network where the gradient stop Softmax penalty function is deployed and the BN-ReLU removal technique is applied. For the query sample, the obtained feature F is subjected to model feature reasoning _q The invention extracts the characteristics of all images in the image library and stores the extracted characteristics as a characteristic sequence { F } ₁ ,…,F _m Then calculate F _q And euclidean distance of all image features in the image library:

d＝||F _q -F _i || ₂ ,i＝1,2,3,…,m

further, the invention obtains a distance sequence D= [ D ] ₁ ,d ₂ ,…,d _m ]Then the invention reorders D by distance, the invention takes L images closest to the query sample, if the images with the same ID as the query sample exist in the images, the image retrieval is considered to be successful.

From the aspect of recall analysis, the method not only solves the problem that the model is difficult to converge, but also reaches 75.9% on CUB-200-2011, 94.7% on CARS196 and 83.1% on SOP, and at least exceeds 1.7%, 2.9% and 1.7% compared with the traditional method.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof, but rather as various modifications, equivalent arrangements, improvements, etc., which fall within the spirit and principles of the present invention.

Claims

1. The image retrieval and identification method based on the autonomous evolution loss is characterized by comprising the following steps:

the gradient stop Softmax loss function is as follows equation (2):

represents L2 normalization, and->

Indicating that it is not allowed to pass through W _j Gradient updating is performed, and gamma is a predefined scalar; softplus (x) =log (1+e ^x )，

(10) Obtaining a distance sequence D= [ D ] ₁ ,d ₂ ,…,d _m ]；

2. The method for image retrieval and recognition based on autonomous evolution loss according to claim 1, wherein: the step (5) combines the Softmax loss function and the gradient stop Softmax function, and the total loss is expressed as follows:

L＝L _softmax +L _SGSL (3)

3. The method for image retrieval and recognition based on autonomous evolution loss according to claim 1, wherein: the image library used is constructed as follows: