CN113704537A - Fine-grained cross-media retrieval method based on multi-scale feature union - Google Patents
Fine-grained cross-media retrieval method based on multi-scale feature union Download PDFInfo
- Publication number
- CN113704537A CN113704537A CN202111258804.8A CN202111258804A CN113704537A CN 113704537 A CN113704537 A CN 113704537A CN 202111258804 A CN202111258804 A CN 202111258804A CN 113704537 A CN113704537 A CN 113704537A
- Authority
- CN
- China
- Prior art keywords
- sample
- class
- fine
- grained
- level
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Library & Information Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a fine-grained cross-media retrieval method based on multi-scale feature combination. The method effectively solves the problems that only class loss constraint of sample-level features exists in the traditional public feature extraction method, background noise and non-critical areas in the sample cause misleading to fine-grained class prediction, and the background noise and the non-critical area features in the sample have large influence. According to the invention, extra parameters are not required to be introduced, the calculation cost is hardly increased, the common features of the fine-grained data are more accurately extracted, and the fine-grained cross-media retrieval effect is further improved.
Description
Technical Field
The invention belongs to the technical field of information retrieval, and particularly relates to a fine-grained cross-media retrieval method based on multi-scale feature union.
Background
In order to realize high-quality fine information retrieval, fine-grained cross-media retrieval becomes a research hotspot in a big data era. Compared with the traditional cross-media retrieval, the fine-grained cross-media retrieval is based on the accurate public feature extraction capability, and more accurate and efficient multimedia retrieval service can be provided for users. Due to the difficulty of large intra-class difference of inter-class difference of the fine-grained data set, the feature extraction of the fine-grained sample is carried out by directly using the traditional deep convolution network, and the experimental effect is often not ideal enough.
The key of the feature extraction of the fine-grained sample is the positioning and identification of a local key area, so that the target detail features are accurately extracted to obtain a better cross-media retrieval effect. In the common feature extraction process of cross-media retrieval, what often plays a major role is a critical area of a small portion of fine-grained data, such as the head, wings or tail of an avian target. While other large areas tend to be only non-critical areas of background noise or objects.
In the traditional fine-grained data identification method, a key area needs to be positioned through complex calculation such as an attention mechanism and the like, the key area is cut out from original data, and then the key area is input into a deep network to extract fine-grained characteristics. The model complexity of the method is often high, the calculation cost is high, and when the key area is not accurately positioned, the feature extraction result is seriously influenced.
Disclosure of Invention
The invention aims to provide a fine-grained cross-media retrieval method based on multi-scale feature combination, which is characterized in that on the basis of the traditional sample-level features, target-level features and pixel-level features of key areas are additionally introduced, and three category loss functions jointly constrain depth convolution networks are constructed based on the three scale features; the method effectively solves the problems that only the category loss constraint of the sample-level features exists in the traditional public feature extraction method, the background noise and non-critical areas in the sample can possibly cause misleading to fine-grained category prediction, and the background noise and non-critical area features in the sample have larger influence.
The invention is mainly realized by the following technical scheme:
a fine-grained cross-media retrieval method based on multi-scale feature union comprises the following steps:
step S100: acquiring a cross-media data set containing image samples; processing the image sample by a deep convolutional neural network to obtain a group of N multiplied by H multiplied by W characteristic images, wherein N is the channel number of the characteristic images, and H and W are the length and the width of each characteristic image respectively;
step S200: inputting the feature map in the step S100 into a global average pooling layer to obtain sample-level features, processing the sample-level features through a full-connection layer, and calculating to obtain sample-level feature category loss;
step S300: sequentially accumulating the characteristic diagram in the step S100, reserving a maximum connected component, performing threshold value binarization processing to remove background interference, reserving a target key area to obtain a target level characteristic, and calculating to obtain a target level characteristic category loss;
step S400: setting a category label for each pixel of the feature map in the step S100, calculating and accumulating category loss functions of all pixels, realizing finer positioning of a target key area, obtaining pixel-level features, and calculating to obtain pixel-level feature category loss;
step S500: combining sample-level feature, target-level feature and pixel-level feature to jointly constrain a feature extraction network;
step S600: media features are extracted through a feature extraction network, the similarity among different media features is measured, and the media features are sequenced according to the similarity, so that retrieval is realized.
In order to better implement the present invention, in step S100, a ResNet-50 network is used to extract a group of 2048 × 14 × 14 feature maps S, and the feature maps S are recorded asWhere i =1, 2 …, N.
In order to better implement the present invention, further, the step S200 specifically includes the following steps:
The 2048-dimensional sample-level features are then combinedObtaining 200-dimensional class scores through 2048 x 200 full-connected layers:
Wherein:
I、T、V、Arespectively representing image, text, audio and video media types,
kis a sample number, and is a sample number,
l(p,y) For the cross entropy loss function:
whereinCIs the total number of categories.
In order to better implement the present invention, further, the step S300 specifically includes the following steps:
firstly, accumulating the characteristic diagram S along the channel dimension to obtain an original activation diagram A:
then, the original activation image A is reserved with the maximum connected component to obtain a de-noised activation image:
The de-noised activation map A is then based on the response meanThreshold value binarization is carried out to obtain a target mask:
Wherein:
ain response to the threshold value(s) being set,
finally, the characteristic diagram S and the target mask are combinedMultiplying corresponding positions and inputting the multiplied corresponding positions into a global average pooling layer to obtain target level characteristics:
To obtainSubstituting the target characteristic class loss into the target characteristic class loss:
Wherein:
p is the probability of a fine-grained class,
I、T、V、Arespectively representing image, text, audio and video media types,
kis a sample number, and is a sample number,
l(p,y) For the cross entropy loss function:
whereinCIs the total number of categories.
In order to better implement the present invention, further, the step S400 specifically includes the following steps:
first, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map S: (i,j) Generating pixel-level auxiliary labels:
Wherein:
kis a sample number, and is a sample number,
c is the total number of the categories,
converting the numerical representation of each pixel class from 1 to C into a one-hot vector representation y:
wherein: m represents the mth of C +1,
inputting the feature map S into a convolution layer with convolution kernel of 1 × 1, and obtaining pixel-level features of (C +1) × H × W with N input channels, i.e. C +1 output channels:
Then the pixel level features are combinedThe class prediction score of each pixel of (a) is converted to a class probability by a Softmax function:
at this point, the fine-grained classification of each pixel is lostThe calculation formula is as follows:
separately accumulating target pixel fine-grained class losses andfine granular class loss with background pixels and:
final pixel level feature class lossByAndand (3) linearly combining according to pixel proportion to obtain:
in order to better implement the present invention, further, the loss function of the feature extraction network in step S500 is the sum of the sample-level feature class loss, the target-level feature class loss, and the pixel-level feature class loss:
wherein:
In order to better implement the present invention, further, the sample-level feature class loss, the target-level feature class loss, and the pixel-level feature class loss are all obtained based on a feature map, and all use a cross entropy loss function to constrain class probabilities.
The invention has the beneficial effects that:
on the basis of the traditional sample-level features, the invention additionally introduces target-level features and pixel-level features of key regions, and constructs a training process of three class loss functions jointly constraining the deep convolutional network based on the three scale features. The method effectively solves the problems that only the category loss constraint of the sample-level features exists in the traditional public feature extraction method, the background noise and non-critical areas in the sample can possibly cause misleading to fine-grained category prediction, and the background noise and non-critical area features in the sample have larger influence. According to the invention, extra parameters are not required to be introduced, the calculation cost is hardly increased, the common features of the fine-grained data are more accurately extracted, and the fine-grained cross-media retrieval effect is further improved.
Drawings
FIG. 1 is a functional block diagram of the present invention;
FIG. 2 is a schematic diagram of Feature Maps activation regions;
FIG. 3 is a schematic view of a target location process;
fig. 4 is a comparison graph of target positioning accuracy.
Detailed Description
Example 1:
a fine-grained cross-media retrieval method based on multi-scale feature union is shown in FIG. 1, and comprises the following steps:
step S100: acquiring a cross-media data set containing image samples; processing the image sample by a deep convolutional neural network to obtain a group of N multiplied by H multiplied by W characteristic images, wherein N is the channel number of the characteristic images, and H and W are the length and the width of each characteristic image respectively;
step S200: inputting the feature map in the step S100 into a global average pooling layer to obtain sample-level features, processing the sample-level features through a full-connection layer, and calculating to obtain sample-level feature category loss;
step S300: sequentially accumulating the characteristic diagram in the step S100, reserving a maximum connected component, performing threshold value binarization processing to remove background interference, reserving a target key area to obtain a target level characteristic, and calculating to obtain a target level characteristic category loss;
step S400: setting a category label for each pixel of the feature map in the step S100, calculating and accumulating category loss functions of all pixels, realizing finer positioning of a target key area, obtaining pixel-level features, and calculating to obtain pixel-level feature category loss;
step S500: combining sample-level feature, target-level feature and pixel-level feature to jointly constrain a feature extraction network;
step S600: media features are extracted through a feature extraction network, the similarity among different media features is measured, and the media features are sequenced according to the similarity, so that retrieval is realized.
Further, in the step S100, a ResNet-50 network is adopted to extract a group of characteristic maps S of 2048 × 14 × 14, and the characteristic maps S are recorded asWhere i =1, 2 …, N.
Example 2:
in this embodiment, optimization is performed on the basis of embodiment 1, and the step S200 specifically includes the following steps:
The 2048-dimensional sample-level features are then combinedObtaining 200-dimensional class scores through 2048 x 200 full-connected layers:
Wherein:
I、T、V、Arespectively representing image, text, audio and video media types,
kis a sample number, and is a sample number,
l(p,y) For the cross entropy loss function:
whereinCIs the total number of categories.
Other parts of this embodiment are the same as embodiment 1, and thus are not described again.
Example 3:
in this embodiment, optimization is performed on the basis of embodiment 1 or 2, and the step S300 specifically includes the following steps:
firstly, accumulating the characteristic diagram S along the channel dimension to obtain an original activation diagram A:
then, the original activation image A is reserved with the maximum connected component to obtain a de-noised activation image:
The de-noised activation map A is then based on the response meanThreshold value binarization is carried out to obtain a target mask:
Wherein:
ain response to the threshold value(s) being set,
finally, the characteristic diagram S and the target mask are combinedMultiplying corresponding positions and inputting the multiplied corresponding positions into a global average pooling layer to obtain target level characteristics:
To obtainSubstituting the target characteristic class loss into the target characteristic class loss:
Wherein:
p is the probability of a fine-grained class,
I、T、V、Arespectively representing image, text, audio and video media types,
kis a sample number, and is a sample number,
l(p,y) For the cross entropy loss function:
whereinCIs the total number of categories.
Further, the step S400 specifically includes the following steps:
first, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map S: (i,j) Generating pixel-level auxiliary labels:
Wherein:
kis a sample number, and is a sample number,
c is the total number of the categories,
converting the numerical representation of each pixel class from 1 to C into a one-hot vector representation y:
wherein: m represents the mth of C +1,
inputting the feature map S into a convolution layer with convolution kernel of 1 × 1, and obtaining pixel-level features of (C +1) × H × W with N input channels, i.e. C +1 output channels:
Then the pixel level features are combinedThe class prediction score of each pixel of (a) is converted to a class probability by a Softmax function:
at this point, the fine-grained classification of each pixel is lostThe calculation formula is as follows:
separately accumulating target pixel fine-grained class losses andfine granular class loss with background pixels and:
final pixel level feature class lossByAndand (3) linearly combining according to pixel proportion to obtain:
the rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.
Example 4:
a fine-grained cross-media retrieval method based on multi-scale feature combination is characterized in that a sample-level feature scale is expanded into three scales of a sample level, a target level and a pixel level, and fine-grained category information loss constraint is carried out. Firstly, the defects of the traditional sample-level features are analyzed, and the idea of key area positioning is selected to obtain more accurate fine-grained features. And then, introducing target level features, removing background interference by a method of accumulating a feature map, carrying out mean binarization and reserving a maximum connected component, and reserving a target key area. And then, introducing pixel-level features, setting a class label for each pixel of the feature map, and calculating and accumulating class loss functions of all pixels to more finely position a target key area. And finally combining the class loss function constraint characteristics of the three characteristic scales to extract the network.
Further, as shown in fig. 1, taking an image media class as an example, steps and a flow of an MSFG (Multi-scale Fine-grained) method are presented.
Where lcc (large Connected component) denotes the largest Connected component retained,
AVG pool represents the global average pooling layer,
FC denotes a fully connected layer, Σ denotes accumulation,
conv is a number of the convolutional layers,
n represents the number of channels of the feature map, H and W are the length and width of each feature map respectively, and C represents the total number of categories.
As shown in FIG. 1, an image sample is first passed through a conventional deep convolutional neural network ResNet-50 to obtain a set of N × H × W Feature Maps (Feature Maps). And then, obtaining the feature class losses of three scales through three feature graph processing flows respectively, wherein Sample Loss, Target Loss and Pixel Loss in the graph correspond to Sample-level feature class Loss, Target-level feature class Loss and Pixel-level feature class Loss respectively. The sample-level features retain global information such as context, spatial position and related background, and have certain contribution to feature extraction of various media types, especially text media and audio media samples. Meanwhile, for a fine-grained data set, the global information often includes some unique semantic category information, and the unique information may be morphologically different within the same fine-grained category. Therefore, class loss of sample-level features has irreplaceability for learning of fine-grained features.
Further, the sample-level feature class loss is calculated as follows:
the 2048X 14 characteristic map (Feature Maps) extracted by ResNet-50 is recorded as(i =1, 2 …, N), inputting S into the global average pooling layer to obtain sample-level features:
The 2048-dimensional sample-level features are then combinedObtaining 200-dimensional class scores through 2048 x 200 full-connected layers:
Will be provided withObtaining the probability p of the fine-grained category through a Softmax function:
and finally, constructing a sample-level fine-grained class loss by using a sample class label y:
whereinI,T,V,ARespectively representing image, text, audio and video media types,
l(p, y) is the cross entropy loss function:
further, in order to reduce the computational efficiency and the model complexity, the feature-based graph S of the three-scale class loss is obtained, and the class probabilities are all constrained by using a cross-entropy loss function, so that the following target-level lossAnd pixel level lossThe only difference from the sample-level penalty is the computational process from the feature map S to X.
Further, the target-level feature class loss is calculated as follows:
in the field of fine-grained cross-media retrieval, locating a target area and a key part is an effective method for extracting fine-grained features. Inspired by the SCDA algorithm, the convolution Feature map Maps in front of the global pooling layer in the deep convolutional network have activation responses aiming at different local positions on each channel.
As shown in fig. 2, the graph randomly selects a Feature map of 4 channels from Feature Maps for display, and it can be seen from the figure that: in the first row, channels 108 and 468 activate the legs of the hummingbirds, channel 375 activates the heads of the hummingbirds, and channel 284 does not even activate any region; in the second row, the 468 th, 375 th and 284 th channels activate the hummingbird head, feet and body, respectively, while the 108 th channel activates background noise. Experiments verify that the channels for activating background noise are few, and a large number of activation responses are concentrated in a target position and a key area thereof. As shown in fig. 3, Feature Maps are accumulated along the channel dimension, and then the maximum connected component is retained, so that the target position can be effectively positioned, background noise interference is eliminated, and binarization is performed to obtain a target mask.
Firstly, feature map(i =1, 2 …, N) are accumulated along the channel dimension to yield the original activation mapA:
Then the original activation map is usedAObtaining a de-noised activation map by reserving the maximum connected component:
The de-noised activation map A is then based on the response meanThreshold value binarization is carried out to obtain a target mask:
Finally, the characteristic diagram S and the target mask are combinedMultiplying corresponding positions and inputting the multiplied corresponding positions into a global average pooling layer to obtain target level characteristics:
To obtainSubstituting the obtained data into the obtained data to obtain the target-level fine-grained category loss:
In contrast to sample-level features, target-level features focus only on the target region and its critical parts. The method can automatically eliminate the influence of background noise, further amplify the fine-grained characteristic of the target, enable the prediction of the fine-grained category probability to be completely dependent on the key information provided by the target area, and effectively improve the prediction accuracy of the fine-grained category.
Further, the pixel-level feature class loss is calculated as follows:
the pixel-level features set forth in this disclosure refer to pixels on the feature map, rather than pixels in the 448 x 3 matrix of the original sample. In the foregoing, there is a certain error and ambiguity in the method for locating the target position by using the feature map in the way of channel dimension accumulation, that is, the located target Mask is often not tightly attached to the target edge, and there is a part of background residue. As shown in fig. 4, the target location of fig. 4 (b) is more accurate and can focus on extracting more effective fine-grained features than fig. 4 (a).
In order to realize more accurate target positioning, on the basis of the target-level features, pixel-level features are further introduced to confirm the target area at a finer scale. First, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map: (i,j) Generating pixel-level auxiliary labels:
Converting the numerical representation of each pixel class from 1 to C into a one-hot vector representation y:
will feature map(i =1, 2 …, N) a convolution layer having a convolution kernel of 1 × 1 is input, and the number of input channels is N
The number of output channels is C +1, namely 201, and the pixel level characteristics of (C +1) multiplied by H multiplied by W are obtained:
Then the pixel level features are combinedThe class prediction score of each pixel of (a) is converted to a class probability by a Softmax function:
at this time, the fine-grained category loss calculation formula of each pixel point is as follows:
separately accumulating target pixel fine-grained class losses andfine granular class loss with background pixels and:
final pixel level fine-grained class lossByAndand (3) linearly combining according to pixel proportion to obtain:
in general, the class prediction difficulty of a fine-grained sample is generally proportional to the size of a target, and the higher the target proportion is, the higher the probability of accurate class prediction is. Loss of targetAnd background lossThe smaller the target is, the higher the proportion of the background to the total pixels is, and the higher the weight lost by the target is, so that the network parameters are more biased to the feature learning of the target area. Finally, the network total loss function is the sum of the class losses of the sample-level, target-level and pixel-level features:
on the basis of the traditional sample-level features, the invention additionally introduces target-level features and pixel-level features of key regions, and constructs a training process of three class loss functions jointly constraining the deep convolutional network based on the three scale features. The method effectively solves the problems that only the category loss constraint of the sample-level features exists in the traditional public feature extraction method, the background noise and non-critical areas in the sample can possibly cause misleading to fine-grained category prediction, and the background noise and non-critical area features in the sample have larger influence. According to the invention, extra parameters are not required to be introduced, the calculation cost is hardly increased, the common features of the fine-grained data are more accurately extracted, and the fine-grained cross-media retrieval effect is further improved.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.
Claims (7)
1. A fine-grained cross-media retrieval method based on multi-scale feature union is characterized by comprising the following steps:
step S100: acquiring a cross-media data set containing image samples; processing the image sample by a deep convolutional neural network to obtain a group of N multiplied by H multiplied by W characteristic images, wherein N is the channel number of the characteristic images, and H and W are the length and the width of each characteristic image respectively;
step S200: inputting the feature map in the step S100 into a global average pooling layer to obtain sample-level features, processing the sample-level features through a full-connection layer, and calculating to obtain sample-level feature category loss;
step S300: sequentially accumulating the characteristic diagram in the step S100, reserving a maximum connected component, performing threshold value binarization processing to remove background interference, reserving a target key area to obtain a target level characteristic, and calculating to obtain a target level characteristic category loss;
step S400: setting a category label for each pixel of the feature map in the step S100, calculating and accumulating category loss functions of all pixels, realizing finer positioning of a target key area, obtaining pixel-level features, and calculating to obtain pixel-level feature category loss;
step S500: combining sample-level feature, target-level feature and pixel-level feature to jointly constrain a feature extraction network;
step S600: media features are extracted through a feature extraction network, the similarity among different media features is measured, and the media features are sequenced according to the similarity, so that retrieval is realized.
3. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 2, wherein the step S200 specifically comprises the following steps:
The 2048-dimensional sample-level features are then combinedObtaining 200-dimensional class scores through 2048 x 200 full-connected layers:
Wherein:
I、T、V、Arespectively representing image, text, audio and video media types,
kis a sample number, and is a sample number,
l(p,y) For the cross entropy loss function:
whereinCIs the total number of categories.
4. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 2, wherein the step S300 specifically comprises the following steps:
firstly, accumulating the characteristic diagram S along the channel dimension to obtain an original activation diagramA:
Then the original activation map is usedAObtaining a de-noised activation map by reserving the maximum connected component:
The de-noised activation map A is then based on the response meanThreshold value binarization is carried out to obtain a target mask:
Wherein:
ain response to the threshold value(s) being set,
finally, the characteristic diagram S and the target mask are combinedMultiplying corresponding positions and inputting the multiplied corresponding positions into a global average pooling layer to obtain target level characteristics:
To obtainSubstituting the target characteristic class loss into the target characteristic class loss:
Wherein:
p is the probability of a fine-grained class,
I、T、V、Arespectively representing image, text, audio and video media types,
kis a sample number, and is a sample number,
l(p,y) For the cross entropy loss function:
whereinCIs the total number of categories.
5. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 4, wherein the step S400 specifically comprises the following steps:
first, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map S: (i,j) Generating pixel-level auxiliary labels:
Wherein:
kis a sample number, and is a sample number,
c is the total number of the categories,
converting the numerical representation of each pixel class from 1 to C into a one-hot vector representation y:
wherein: m represents the mth of C +1,
inputting the feature map S into a convolution layer with convolution kernel of 1 × 1, and obtaining pixel-level features of (C +1) × H × W with N input channels, i.e. C +1 output channels:
Then the pixel level features are combinedThe class prediction score of each pixel of (a) is converted to a class probability by a Softmax function:
at this point, the fine-grained classification of each pixel is lostThe calculation formula is as follows:
separately accumulating target pixel fine-grained class losses andfine granular class loss with background pixels and:
final pixel level feature class lossByAndand (3) linearly combining according to pixel proportion to obtain:
6. the fine-grained cross-media retrieval method based on multi-scale feature union of claim 1, wherein the loss function of the feature extraction network in step S500 is the sum of sample-level feature class loss, target-level feature class loss and pixel-level feature class loss:
wherein:
7. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 6, wherein the sample-level feature class loss, the target-level feature class loss and the pixel-level feature class loss are all obtained based on a feature map, and all use a cross entropy loss function to constrain class probability.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111258804.8A CN113704537B (en) | 2021-10-28 | 2021-10-28 | Fine-grained cross-media retrieval method based on multi-scale feature union |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111258804.8A CN113704537B (en) | 2021-10-28 | 2021-10-28 | Fine-grained cross-media retrieval method based on multi-scale feature union |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113704537A true CN113704537A (en) | 2021-11-26 |
CN113704537B CN113704537B (en) | 2022-02-15 |
Family
ID=78647132
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111258804.8A Active CN113704537B (en) | 2021-10-28 | 2021-10-28 | Fine-grained cross-media retrieval method based on multi-scale feature union |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113704537B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10176202B1 (en) * | 2018-03-06 | 2019-01-08 | Xanadu Big Data, Llc | Methods and systems for content-based image retrieval |
CN109800629A (en) * | 2018-12-05 | 2019-05-24 | 天津大学 | A kind of Remote Sensing Target detection method based on convolutional neural networks |
CN111104538A (en) * | 2019-12-06 | 2020-05-05 | 深圳久凌软件技术有限公司 | Fine-grained vehicle image retrieval method and device based on multi-scale constraint |
CN111782833A (en) * | 2020-06-09 | 2020-10-16 | 南京理工大学 | Fine-grained cross-media retrieval method based on multi-model network |
CN112668494A (en) * | 2020-12-31 | 2021-04-16 | 西安电子科技大学 | Small sample change detection method based on multi-scale feature extraction |
CN112800249A (en) * | 2021-02-01 | 2021-05-14 | 南京理工大学 | Fine-grained cross-media retrieval method based on generation of countermeasure network |
CN113064959A (en) * | 2020-01-02 | 2021-07-02 | 南京邮电大学 | Cross-modal retrieval method based on deep self-supervision sorting Hash |
CN113270199A (en) * | 2021-04-30 | 2021-08-17 | 贵州师范大学 | Medical cross-modal multi-scale fusion class guidance hash method and system thereof |
US20210256365A1 (en) * | 2017-04-10 | 2021-08-19 | Peking University Shenzhen Graduate School | Cross-media retrieval method based on deep semantic space |
-
2021
- 2021-10-28 CN CN202111258804.8A patent/CN113704537B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210256365A1 (en) * | 2017-04-10 | 2021-08-19 | Peking University Shenzhen Graduate School | Cross-media retrieval method based on deep semantic space |
US10176202B1 (en) * | 2018-03-06 | 2019-01-08 | Xanadu Big Data, Llc | Methods and systems for content-based image retrieval |
CN109800629A (en) * | 2018-12-05 | 2019-05-24 | 天津大学 | A kind of Remote Sensing Target detection method based on convolutional neural networks |
CN111104538A (en) * | 2019-12-06 | 2020-05-05 | 深圳久凌软件技术有限公司 | Fine-grained vehicle image retrieval method and device based on multi-scale constraint |
CN113064959A (en) * | 2020-01-02 | 2021-07-02 | 南京邮电大学 | Cross-modal retrieval method based on deep self-supervision sorting Hash |
CN111782833A (en) * | 2020-06-09 | 2020-10-16 | 南京理工大学 | Fine-grained cross-media retrieval method based on multi-model network |
CN112668494A (en) * | 2020-12-31 | 2021-04-16 | 西安电子科技大学 | Small sample change detection method based on multi-scale feature extraction |
CN112800249A (en) * | 2021-02-01 | 2021-05-14 | 南京理工大学 | Fine-grained cross-media retrieval method based on generation of countermeasure network |
CN113270199A (en) * | 2021-04-30 | 2021-08-17 | 贵州师范大学 | Medical cross-modal multi-scale fusion class guidance hash method and system thereof |
Non-Patent Citations (2)
Title |
---|
DONG JIANFENG: "Cross-media Relevance Computation for Multimedia Retrieval", 《ACM DIGITAL LIBRARY》 * |
周菊香: "图像检索中的特征表达和相似性度量方法研究", 《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113704537B (en) | 2022-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109919108B (en) | Remote sensing image rapid target detection method based on deep hash auxiliary network | |
CN108492272B (en) | Cardiovascular vulnerable plaque identification method and system based on attention model and multitask neural network | |
Liu et al. | Star-net: a spatial attention residue network for scene text recognition. | |
CN109886121B (en) | Human face key point positioning method for shielding robustness | |
US8111923B2 (en) | System and method for object class localization and semantic class based image segmentation | |
CN110807422A (en) | Natural scene text detection method based on deep learning | |
CN109684922B (en) | Multi-model finished dish identification method based on convolutional neural network | |
Roy et al. | Bayesian classifier for multi-oriented video text recognition system | |
CN111339975B (en) | Target detection, identification and tracking method based on central scale prediction and twin neural network | |
Li et al. | Category dictionary guided unsupervised domain adaptation for object detection | |
CN111783576A (en) | Pedestrian re-identification method based on improved YOLOv3 network and feature fusion | |
CN112800249A (en) | Fine-grained cross-media retrieval method based on generation of countermeasure network | |
CN111738090A (en) | Pedestrian re-recognition model training method and device and pedestrian re-recognition method and device | |
CN112017192A (en) | Glandular cell image segmentation method and system based on improved U-Net network | |
Wang et al. | Multi-scale fish segmentation refinement and missing shape recovery | |
CN111553349A (en) | Scene text positioning and identifying method based on full convolution network | |
Makhmudov et al. | Improvement of the end-to-end scene text recognition method for “text-to-speech” conversion | |
CN111553351A (en) | Semantic segmentation based text detection method for arbitrary scene shape | |
CN114510594A (en) | Traditional pattern subgraph retrieval method based on self-attention mechanism | |
Srihari et al. | Forensic handwritten document retrieval system | |
CN111815582A (en) | Two-dimensional code area detection method for improving background prior and foreground prior | |
CN114882204A (en) | Automatic ship name recognition method | |
CN113657225A (en) | Target detection method | |
CN113704537B (en) | Fine-grained cross-media retrieval method based on multi-scale feature union | |
CN111210433B (en) | Markov field remote sensing image segmentation method based on anisotropic potential function |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |