CN113704537A

CN113704537A - Fine-grained cross-media retrieval method based on multi-scale feature union

Info

Publication number: CN113704537A
Application number: CN202111258804.8A
Authority: CN
Inventors: 姚亚洲; 孙泽人; 陈涛; 张传一; 沈复民
Original assignee: Nanjing Code Geek Technology Co ltd
Current assignee: Nanjing Code Geek Technology Co ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2021-11-26
Anticipated expiration: 2041-10-28
Also published as: CN113704537B

Abstract

The invention discloses a fine-grained cross-media retrieval method based on multi-scale feature combination. The method effectively solves the problems that only class loss constraint of sample-level features exists in the traditional public feature extraction method, background noise and non-critical areas in the sample cause misleading to fine-grained class prediction, and the background noise and the non-critical area features in the sample have large influence. According to the invention, extra parameters are not required to be introduced, the calculation cost is hardly increased, the common features of the fine-grained data are more accurately extracted, and the fine-grained cross-media retrieval effect is further improved.

Description

Fine-grained cross-media retrieval method based on multi-scale feature union

Technical Field

The invention belongs to the technical field of information retrieval, and particularly relates to a fine-grained cross-media retrieval method based on multi-scale feature union.

Background

In order to realize high-quality fine information retrieval, fine-grained cross-media retrieval becomes a research hotspot in a big data era. Compared with the traditional cross-media retrieval, the fine-grained cross-media retrieval is based on the accurate public feature extraction capability, and more accurate and efficient multimedia retrieval service can be provided for users. Due to the difficulty of large intra-class difference of inter-class difference of the fine-grained data set, the feature extraction of the fine-grained sample is carried out by directly using the traditional deep convolution network, and the experimental effect is often not ideal enough.

The key of the feature extraction of the fine-grained sample is the positioning and identification of a local key area, so that the target detail features are accurately extracted to obtain a better cross-media retrieval effect. In the common feature extraction process of cross-media retrieval, what often plays a major role is a critical area of a small portion of fine-grained data, such as the head, wings or tail of an avian target. While other large areas tend to be only non-critical areas of background noise or objects.

In the traditional fine-grained data identification method, a key area needs to be positioned through complex calculation such as an attention mechanism and the like, the key area is cut out from original data, and then the key area is input into a deep network to extract fine-grained characteristics. The model complexity of the method is often high, the calculation cost is high, and when the key area is not accurately positioned, the feature extraction result is seriously influenced.

Disclosure of Invention

The invention aims to provide a fine-grained cross-media retrieval method based on multi-scale feature combination, which is characterized in that on the basis of the traditional sample-level features, target-level features and pixel-level features of key areas are additionally introduced, and three category loss functions jointly constrain depth convolution networks are constructed based on the three scale features; the method effectively solves the problems that only the category loss constraint of the sample-level features exists in the traditional public feature extraction method, the background noise and non-critical areas in the sample can possibly cause misleading to fine-grained category prediction, and the background noise and non-critical area features in the sample have larger influence.

The invention is mainly realized by the following technical scheme:

a fine-grained cross-media retrieval method based on multi-scale feature union comprises the following steps:

step S100: acquiring a cross-media data set containing image samples; processing the image sample by a deep convolutional neural network to obtain a group of N multiplied by H multiplied by W characteristic images, wherein N is the channel number of the characteristic images, and H and W are the length and the width of each characteristic image respectively;

step S200: inputting the feature map in the step S100 into a global average pooling layer to obtain sample-level features, processing the sample-level features through a full-connection layer, and calculating to obtain sample-level feature category loss;

step S300: sequentially accumulating the characteristic diagram in the step S100, reserving a maximum connected component, performing threshold value binarization processing to remove background interference, reserving a target key area to obtain a target level characteristic, and calculating to obtain a target level characteristic category loss;

step S400: setting a category label for each pixel of the feature map in the step S100, calculating and accumulating category loss functions of all pixels, realizing finer positioning of a target key area, obtaining pixel-level features, and calculating to obtain pixel-level feature category loss;

step S500: combining sample-level feature, target-level feature and pixel-level feature to jointly constrain a feature extraction network;

step S600: media features are extracted through a feature extraction network, the similarity among different media features is measured, and the media features are sequenced according to the similarity, so that retrieval is realized.

In order to better implement the present invention, in step S100, a ResNet-50 network is used to extract a group of 2048 × 14 × 14 feature maps S, and the feature maps S are recorded as

Where i =1, 2 …, N.

In order to better implement the present invention, further, the step S200 specifically includes the following steps:

inputting the feature map S into a global average pooling layer to obtain sample-level features

：

The 2048-dimensional sample-level features are then combined

Obtaining 200-dimensional class scores through 2048 x 200 full-connected layers

：

Will be provided with

Obtaining fine-grained class probability through Softmax functionp：

Finally, constructing sample-level feature class loss by using sample class label y

：

Wherein:

indicating the number of samples for a batch,

I、T、V、Arespectively representing image, text, audio and video media types,

is the fine-grained class probability of the image type sample,

is a class label for the sample of the image type,

for fine-grained class probabilities of text type samples,

is a category label for the sample of text type,

being the fine-grained class probabilities of the audio type samples,

is a soundA class label for the frequency-type samples,

is the fine-grained class probability of a video type sample,

is a class label for the sample of the video type,

kis a sample number, and is a sample number,

l（p，y) For the cross entropy loss function:

whereinCIs the total number of categories.

In order to better implement the present invention, further, the step S300 specifically includes the following steps:

firstly, accumulating the characteristic diagram S along the channel dimension to obtain an original activation diagram A:

then, the original activation image A is reserved with the maximum connected component to obtain a de-noised activation image

：

The de-noised activation map A is then based on the response mean

Threshold value binarization is carried out to obtain a target mask

：

Wherein:

ain response to the threshold value(s) being set,

for de-noising activation maps (i，j) The value of the position is such that,

for de-noising activation maps (i，j) The target mask at the location of the position,

finally, the characteristic diagram S and the target mask are combined

Multiplying corresponding positions and inputting the multiplied corresponding positions into a global average pooling layer to obtain target level characteristics

：

To obtain

Substituting the target characteristic class loss into the target characteristic class loss

：

Wherein:

a category score is assigned to the target-level feature,

p is the probability of a fine-grained class,

indicating the number of samples for a batch,

is the fine-grained class probability of the image type sample,

is a class label for the sample of the image type,

for fine-grained class probabilities of text type samples,

is a category label for the sample of text type,

being the fine-grained class probabilities of the audio type samples,

is a class label for the sample of the audio type,

is the fine-grained class probability of a video type sample,

is a class label for the sample of the video type,

kis a sample number, and is a sample number,

l（p，y) For the cross entropy loss function:

whereinCIs the total number of categories.

In order to better implement the present invention, further, the step S400 specifically includes the following steps:

first, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map S: (i，j) Generating pixel-level auxiliary labels

：

Wherein:

kis a sample number, and is a sample number,

c is the total number of the categories,

converting the numerical representation of each pixel class from 1 to C into a one-hot vector representation y:

wherein: m represents the mth of C +1,

inputting the feature map S into a convolution layer with convolution kernel of 1 × 1, and obtaining pixel-level features of (C +1) × H × W with N input channels, i.e. C +1 output channels

：

Then the pixel level features are combined

The class prediction score of each pixel of (a) is converted to a class probability by a Softmax function:

at this point, the fine-grained classification of each pixel is lost

The calculation formula is as follows:

separately accumulating target pixel fine-grained class losses and

fine granular class loss with background pixels and

：

wherein:

for the number of fine-grained classes of the target pixel,

for the number of fine-grained classes of background pixels,

final pixel level feature class loss

By

And

and (3) linearly combining according to pixel proportion to obtain:

。

in order to better implement the present invention, further, the loss function of the feature extraction network in step S500 is the sum of the sample-level feature class loss, the target-level feature class loss, and the pixel-level feature class loss:

wherein:

for the sample-level feature class loss,

for the purpose of the target-level feature class loss,

is a pixel level feature class penalty.

In order to better implement the present invention, further, the sample-level feature class loss, the target-level feature class loss, and the pixel-level feature class loss are all obtained based on a feature map, and all use a cross entropy loss function to constrain class probabilities.

The invention has the beneficial effects that:

on the basis of the traditional sample-level features, the invention additionally introduces target-level features and pixel-level features of key regions, and constructs a training process of three class loss functions jointly constraining the deep convolutional network based on the three scale features. The method effectively solves the problems that only the category loss constraint of the sample-level features exists in the traditional public feature extraction method, the background noise and non-critical areas in the sample can possibly cause misleading to fine-grained category prediction, and the background noise and non-critical area features in the sample have larger influence. According to the invention, extra parameters are not required to be introduced, the calculation cost is hardly increased, the common features of the fine-grained data are more accurately extracted, and the fine-grained cross-media retrieval effect is further improved.

Drawings

FIG. 1 is a functional block diagram of the present invention;

FIG. 2 is a schematic diagram of Feature Maps activation regions;

FIG. 3 is a schematic view of a target location process;

fig. 4 is a comparison graph of target positioning accuracy.

Detailed Description

Example 1:

a fine-grained cross-media retrieval method based on multi-scale feature union is shown in FIG. 1, and comprises the following steps:

Further, in the step S100, a ResNet-50 network is adopted to extract a group of characteristic maps S of 2048 × 14 × 14, and the characteristic maps S are recorded as

Where i =1, 2 …, N.

Example 2:

in this embodiment, optimization is performed on the basis of embodiment 1, and the step S200 specifically includes the following steps:

：

The 2048-dimensional sample-level features are then combined

Obtaining 200-dimensional class scores through 2048 x 200 full-connected layers

：

Will be provided with

Obtaining fine-grained class probability through Softmax functionp：

：

Wherein:

indicating the number of samples for a batch,

is the fine-grained class probability of the image type sample,

is a class label for the sample of the image type,

for fine-grained class probabilities of text type samples,

is a category label for the sample of text type,

being the fine-grained class probabilities of the audio type samples,

is a class label for the sample of the audio type,

is the fine-grained class probability of a video type sample,

is a class label for the sample of the video type,

kis a sample number, and is a sample number,

l（p，y) For the cross entropy loss function:

whereinCIs the total number of categories.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

in this embodiment, optimization is performed on the basis of embodiment 1 or 2, and the step S300 specifically includes the following steps:

：

The de-noised activation map A is then based on the response mean

Threshold value binarization is carried out to obtain a target mask

：

Wherein:

ain response to the threshold value(s) being set,

for de-noising activation maps (i，j) The value of the position is such that,

finally, the characteristic diagram S and the target mask are combined

：

To obtain

：

Wherein:

a category score is assigned to the target-level feature,

p is the probability of a fine-grained class,

indicating the number of samples for a batch,

is the fine-grained class probability of the image type sample,

is a class label for the sample of the image type,

for fine-grained class probabilities of text type samples,

is a category label for the sample of text type,

being the fine-grained class probabilities of the audio type samples,

is a class label for the sample of the audio type,

is the fine-grained class probability of a video type sample,

is a class label for the sample of the video type,

kis a sample number, and is a sample number,

l（p，y) For the cross entropy loss function:

whereinCIs the total number of categories.

Further, the step S400 specifically includes the following steps:

：

Wherein:

kis a sample number, and is a sample number,

c is the total number of the categories,

wherein: m represents the mth of C +1,

：

Then the pixel level features are combined

at this point, the fine-grained classification of each pixel is lost

The calculation formula is as follows:

separately accumulating target pixel fine-grained class losses and

fine granular class loss with background pixels and

：

wherein:

for the number of fine-grained classes of the target pixel,

for the number of fine-grained classes of background pixels,

final pixel level feature class loss

By

And

and (3) linearly combining according to pixel proportion to obtain:

。

the rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.

Example 4:

a fine-grained cross-media retrieval method based on multi-scale feature combination is characterized in that a sample-level feature scale is expanded into three scales of a sample level, a target level and a pixel level, and fine-grained category information loss constraint is carried out. Firstly, the defects of the traditional sample-level features are analyzed, and the idea of key area positioning is selected to obtain more accurate fine-grained features. And then, introducing target level features, removing background interference by a method of accumulating a feature map, carrying out mean binarization and reserving a maximum connected component, and reserving a target key area. And then, introducing pixel-level features, setting a class label for each pixel of the feature map, and calculating and accumulating class loss functions of all pixels to more finely position a target key area. And finally combining the class loss function constraint characteristics of the three characteristic scales to extract the network.

Further, as shown in fig. 1, taking an image media class as an example, steps and a flow of an MSFG (Multi-scale Fine-grained) method are presented.

Where lcc (large Connected component) denotes the largest Connected component retained,

AVG pool represents the global average pooling layer,

FC denotes a fully connected layer, Σ denotes accumulation,

conv is a number of the convolutional layers,

n represents the number of channels of the feature map, H and W are the length and width of each feature map respectively, and C represents the total number of categories.

As shown in FIG. 1, an image sample is first passed through a conventional deep convolutional neural network ResNet-50 to obtain a set of N × H × W Feature Maps (Feature Maps). And then, obtaining the feature class losses of three scales through three feature graph processing flows respectively, wherein Sample Loss, Target Loss and Pixel Loss in the graph correspond to Sample-level feature class Loss, Target-level feature class Loss and Pixel-level feature class Loss respectively. The sample-level features retain global information such as context, spatial position and related background, and have certain contribution to feature extraction of various media types, especially text media and audio media samples. Meanwhile, for a fine-grained data set, the global information often includes some unique semantic category information, and the unique information may be morphologically different within the same fine-grained category. Therefore, class loss of sample-level features has irreplaceability for learning of fine-grained features.

Further, the sample-level feature class loss is calculated as follows:

the 2048X 14 characteristic map (Feature Maps) extracted by ResNet-50 is recorded as

(i =1, 2 …, N), inputting S into the global average pooling layer to obtain sample-level features

：

The 2048-dimensional sample-level features are then combined

Obtaining 200-dimensional class scores through 2048 x 200 full-connected layers

：

Will be provided with

Obtaining the probability p of the fine-grained category through a Softmax function:

and finally, constructing a sample-level fine-grained class loss by using a sample class label y:

whereinI，T，V，ARespectively representing image, text, audio and video media types,

is the fine-grained class probability of the image type sample,

is a class label for the sample of the image type,

for fine-grained class probabilities of text type samples,

is a category label for the sample of text type,

being the fine-grained class probabilities of the audio type samples,

is a class label for the sample of the audio type,

is the fine-grained class probability of a video type sample,

is a class label for the sample of the video type,

indicating the number of samples for a batch,

l(p, y) is the cross entropy loss function:

further, in order to reduce the computational efficiency and the model complexity, the feature-based graph S of the three-scale class loss is obtained, and the class probabilities are all constrained by using a cross-entropy loss function, so that the following target-level loss

And pixel level loss

The only difference from the sample-level penalty is the computational process from the feature map S to X.

Further, the target-level feature class loss is calculated as follows:

in the field of fine-grained cross-media retrieval, locating a target area and a key part is an effective method for extracting fine-grained features. Inspired by the SCDA algorithm, the convolution Feature map Maps in front of the global pooling layer in the deep convolutional network have activation responses aiming at different local positions on each channel.

As shown in fig. 2, the graph randomly selects a Feature map of 4 channels from Feature Maps for display, and it can be seen from the figure that: in the first row,

channels

108 and 468 activate the legs of the hummingbirds, channel 375 activates the heads of the hummingbirds, and channel 284 does not even activate any region; in the second row, the 468 th, 375 th and 284 th channels activate the hummingbird head, feet and body, respectively, while the 108 th channel activates background noise. Experiments verify that the channels for activating background noise are few, and a large number of activation responses are concentrated in a target position and a key area thereof. As shown in fig. 3, Feature Maps are accumulated along the channel dimension, and then the maximum connected component is retained, so that the target position can be effectively positioned, background noise interference is eliminated, and binarization is performed to obtain a target mask.

Firstly, feature map

(i =1, 2 …, N) are accumulated along the channel dimension to yield the original activation mapA：

Then the original activation map is usedAObtaining a de-noised activation map by reserving the maximum connected component

：

The de-noised activation map A is then based on the response mean

Threshold value binarization is carried out to obtain a target mask

：

Finally, the characteristic diagram S and the target mask are combined

：

To obtain

Substituting the obtained data into the obtained data to obtain the target-level fine-grained category loss

：

In contrast to sample-level features, target-level features focus only on the target region and its critical parts. The method can automatically eliminate the influence of background noise, further amplify the fine-grained characteristic of the target, enable the prediction of the fine-grained category probability to be completely dependent on the key information provided by the target area, and effectively improve the prediction accuracy of the fine-grained category.

Further, the pixel-level feature class loss is calculated as follows:

the pixel-level features set forth in this disclosure refer to pixels on the feature map, rather than pixels in the 448 x 3 matrix of the original sample. In the foregoing, there is a certain error and ambiguity in the method for locating the target position by using the feature map in the way of channel dimension accumulation, that is, the located target Mask is often not tightly attached to the target edge, and there is a part of background residue. As shown in fig. 4, the target location of fig. 4 (b) is more accurate and can focus on extracting more effective fine-grained features than fig. 4 (a).

In order to realize more accurate target positioning, on the basis of the target-level features, pixel-level features are further introduced to confirm the target area at a finer scale. First, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map: (i，j) Generating pixel-level auxiliary labels

：

will feature map

(i =1, 2 …, N) a convolution layer having a convolution kernel of 1 × 1 is input, and the number of input channels is N

The number of output channels is C +1, namely 201, and the pixel level characteristics of (C +1) multiplied by H multiplied by W are obtained

：

Then the pixel level features are combined

at this time, the fine-grained category loss calculation formula of each pixel point is as follows:

separately accumulating target pixel fine-grained class losses and

fine granular class loss with background pixels and

：

final pixel level fine-grained class loss

By

And

and (3) linearly combining according to pixel proportion to obtain:

in general, the class prediction difficulty of a fine-grained sample is generally proportional to the size of a target, and the higher the target proportion is, the higher the probability of accurate class prediction is. Loss of target

And background loss

The smaller the target is, the higher the proportion of the background to the total pixels is, and the higher the weight lost by the target is, so that the network parameters are more biased to the feature learning of the target area. Finally, the network total loss function is the sum of the class losses of the sample-level, target-level and pixel-level features:

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A fine-grained cross-media retrieval method based on multi-scale feature union is characterized by comprising the following steps:

2. The fine-grained cross-media retrieval method based on multi-scale feature union as claimed in claim 1,it is characterized in that in the step S100, a ResNet-50 network is adopted to extract a group of characteristic maps S of 2048 multiplied by 14 and recorded as

Where i =1, 2 …, N.

3. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 2, wherein the step S200 specifically comprises the following steps:

：

The 2048-dimensional sample-level features are then combined

Obtaining 200-dimensional class scores through 2048 x 200 full-connected layers

：

Will be provided with

Obtaining fine-grained class probability through Softmax functionp：

Finally, the advantages ofConstructing sample-level feature class loss with sample class label y

：

Wherein:

indicating the number of samples for a batch,

is the fine-grained class probability of the image type sample,

is a class label for the sample of the image type,

for fine-grained class probabilities of text type samples,

is a category label for the sample of text type,

being the fine-grained class probabilities of the audio type samples,

is a class label for the sample of the audio type,

is the fine-grained class probability of a video type sample,

is a class label for the sample of the video type,

kis a sample number, and is a sample number,

l（p，y) For the cross entropy loss function:

whereinCIs the total number of categories.

4. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 2, wherein the step S300 specifically comprises the following steps:

firstly, accumulating the characteristic diagram S along the channel dimension to obtain an original activation diagramA：

：

The de-noised activation map A is then based on the response mean

Threshold value binarization is carried out to obtain a target mask

：

Wherein:

ain response to the threshold value(s) being set,

for de-noising activation maps (i，j) The value of the position is such that,

finally, the characteristic diagram S and the target mask are combined

：

To obtain

：

Wherein:

a category score is assigned to the target-level feature,

p is the probability of a fine-grained class,

indicating the number of samples for a batch,

is the fine-grained class probability of the image type sample,

is a class label for the sample of the image type,

for fine-grained class probabilities of text type samples,

is a category label for the sample of text type,

being the fine-grained class probabilities of the audio type samples,

is a class label for the sample of the audio type,

is the fine-grained class probability of a video type sample,

is a class label for the sample of the video type,

kis a sample number, and is a sample number,

l（p，y) For the cross entropy loss function:

whereinCIs the total number of categories.

5. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 4, wherein the step S400 specifically comprises the following steps:

：

Wherein:

kis a sample number, and is a sample number,

c is the total number of the categories,

wherein: m represents the mth of C +1,

：

Then the pixel level features are combined

at this point, the fine-grained classification of each pixel is lost

The calculation formula is as follows:

separately accumulating target pixel fine-grained class losses and

fine granular class loss with background pixels and

：

wherein:

for the number of fine-grained classes of the target pixel,

for the number of fine-grained classes of background pixels,

final pixel level feature class loss

By

And

and (3) linearly combining according to pixel proportion to obtain:

。

6. the fine-grained cross-media retrieval method based on multi-scale feature union of claim 1, wherein the loss function of the feature extraction network in step S500 is the sum of sample-level feature class loss, target-level feature class loss and pixel-level feature class loss:

wherein:

for the sample-level feature class loss,

for the purpose of the target-level feature class loss,

is a pixel level feature class penalty.

7. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 6, wherein the sample-level feature class loss, the target-level feature class loss and the pixel-level feature class loss are all obtained based on a feature map, and all use a cross entropy loss function to constrain class probability.