CN113704537A - Fine-grained cross-media retrieval method based on multi-scale feature union - Google Patents

Fine-grained cross-media retrieval method based on multi-scale feature union Download PDF

Info

Publication number
CN113704537A
CN113704537A CN202111258804.8A CN202111258804A CN113704537A CN 113704537 A CN113704537 A CN 113704537A CN 202111258804 A CN202111258804 A CN 202111258804A CN 113704537 A CN113704537 A CN 113704537A
Authority
CN
China
Prior art keywords
sample
class
fine
grained
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111258804.8A
Other languages
Chinese (zh)
Other versions
CN113704537B (en
Inventor
姚亚洲
孙泽人
陈涛
张传一
沈复民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Code Geek Technology Co ltd
Original Assignee
Nanjing Code Geek Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Code Geek Technology Co ltd filed Critical Nanjing Code Geek Technology Co ltd
Priority to CN202111258804.8A priority Critical patent/CN113704537B/en
Publication of CN113704537A publication Critical patent/CN113704537A/en
Application granted granted Critical
Publication of CN113704537B publication Critical patent/CN113704537B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a fine-grained cross-media retrieval method based on multi-scale feature combination. The method effectively solves the problems that only class loss constraint of sample-level features exists in the traditional public feature extraction method, background noise and non-critical areas in the sample cause misleading to fine-grained class prediction, and the background noise and the non-critical area features in the sample have large influence. According to the invention, extra parameters are not required to be introduced, the calculation cost is hardly increased, the common features of the fine-grained data are more accurately extracted, and the fine-grained cross-media retrieval effect is further improved.

Description

Fine-grained cross-media retrieval method based on multi-scale feature union
Technical Field
The invention belongs to the technical field of information retrieval, and particularly relates to a fine-grained cross-media retrieval method based on multi-scale feature union.
Background
In order to realize high-quality fine information retrieval, fine-grained cross-media retrieval becomes a research hotspot in a big data era. Compared with the traditional cross-media retrieval, the fine-grained cross-media retrieval is based on the accurate public feature extraction capability, and more accurate and efficient multimedia retrieval service can be provided for users. Due to the difficulty of large intra-class difference of inter-class difference of the fine-grained data set, the feature extraction of the fine-grained sample is carried out by directly using the traditional deep convolution network, and the experimental effect is often not ideal enough.
The key of the feature extraction of the fine-grained sample is the positioning and identification of a local key area, so that the target detail features are accurately extracted to obtain a better cross-media retrieval effect. In the common feature extraction process of cross-media retrieval, what often plays a major role is a critical area of a small portion of fine-grained data, such as the head, wings or tail of an avian target. While other large areas tend to be only non-critical areas of background noise or objects.
In the traditional fine-grained data identification method, a key area needs to be positioned through complex calculation such as an attention mechanism and the like, the key area is cut out from original data, and then the key area is input into a deep network to extract fine-grained characteristics. The model complexity of the method is often high, the calculation cost is high, and when the key area is not accurately positioned, the feature extraction result is seriously influenced.
Disclosure of Invention
The invention aims to provide a fine-grained cross-media retrieval method based on multi-scale feature combination, which is characterized in that on the basis of the traditional sample-level features, target-level features and pixel-level features of key areas are additionally introduced, and three category loss functions jointly constrain depth convolution networks are constructed based on the three scale features; the method effectively solves the problems that only the category loss constraint of the sample-level features exists in the traditional public feature extraction method, the background noise and non-critical areas in the sample can possibly cause misleading to fine-grained category prediction, and the background noise and non-critical area features in the sample have larger influence.
The invention is mainly realized by the following technical scheme:
a fine-grained cross-media retrieval method based on multi-scale feature union comprises the following steps:
step S100: acquiring a cross-media data set containing image samples; processing the image sample by a deep convolutional neural network to obtain a group of N multiplied by H multiplied by W characteristic images, wherein N is the channel number of the characteristic images, and H and W are the length and the width of each characteristic image respectively;
step S200: inputting the feature map in the step S100 into a global average pooling layer to obtain sample-level features, processing the sample-level features through a full-connection layer, and calculating to obtain sample-level feature category loss;
step S300: sequentially accumulating the characteristic diagram in the step S100, reserving a maximum connected component, performing threshold value binarization processing to remove background interference, reserving a target key area to obtain a target level characteristic, and calculating to obtain a target level characteristic category loss;
step S400: setting a category label for each pixel of the feature map in the step S100, calculating and accumulating category loss functions of all pixels, realizing finer positioning of a target key area, obtaining pixel-level features, and calculating to obtain pixel-level feature category loss;
step S500: combining sample-level feature, target-level feature and pixel-level feature to jointly constrain a feature extraction network;
step S600: media features are extracted through a feature extraction network, the similarity among different media features is measured, and the media features are sequenced according to the similarity, so that retrieval is realized.
In order to better implement the present invention, in step S100, a ResNet-50 network is used to extract a group of 2048 × 14 × 14 feature maps S, and the feature maps S are recorded as
Figure 100002_DEST_PATH_IMAGE001
Where i =1, 2 …, N.
In order to better implement the present invention, further, the step S200 specifically includes the following steps:
inputting the feature map S into a global average pooling layer to obtain sample-level features
Figure 320069DEST_PATH_IMAGE002
Figure 100002_DEST_PATH_IMAGE003
The 2048-dimensional sample-level features are then combined
Figure 680643DEST_PATH_IMAGE002
Obtaining 200-dimensional class scores through 2048 x 200 full-connected layers
Figure 693861DEST_PATH_IMAGE004
Figure 100002_DEST_PATH_IMAGE005
Will be provided with
Figure 18924DEST_PATH_IMAGE004
Obtaining fine-grained class probability through Softmax functionp
Figure 96470DEST_PATH_IMAGE006
Finally, constructing sample-level feature class loss by using sample class label y
Figure 100002_DEST_PATH_IMAGE007
Figure 893525DEST_PATH_IMAGE008
Wherein:
Figure 100002_DEST_PATH_IMAGE009
indicating the number of samples for a batch,
I、T、V、Arespectively representing image, text, audio and video media types,
Figure 100002_DEST_PATH_IMAGE010
is the fine-grained class probability of the image type sample,
Figure 100002_DEST_PATH_IMAGE011
is a class label for the sample of the image type,
Figure 100002_DEST_PATH_IMAGE012
for fine-grained class probabilities of text type samples,
Figure 100002_DEST_PATH_IMAGE013
is a category label for the sample of text type,
Figure 100002_DEST_PATH_IMAGE014
being the fine-grained class probabilities of the audio type samples,
Figure 100002_DEST_PATH_IMAGE015
is a soundA class label for the frequency-type samples,
Figure 100002_DEST_PATH_IMAGE016
is the fine-grained class probability of a video type sample,
Figure 100002_DEST_PATH_IMAGE017
is a class label for the sample of the video type,
kis a sample number, and is a sample number,
lpy) For the cross entropy loss function:
Figure 100002_DEST_PATH_IMAGE018
whereinCIs the total number of categories.
In order to better implement the present invention, further, the step S300 specifically includes the following steps:
firstly, accumulating the characteristic diagram S along the channel dimension to obtain an original activation diagram A:
Figure 100002_DEST_PATH_IMAGE019
then, the original activation image A is reserved with the maximum connected component to obtain a de-noised activation image
Figure 100002_DEST_PATH_IMAGE020
Figure 100002_DEST_PATH_IMAGE021
The de-noised activation map A is then based on the response mean
Figure 100002_DEST_PATH_IMAGE022
Threshold value binarization is carried out to obtain a target mask
Figure 100002_DEST_PATH_IMAGE023
Figure 100002_DEST_PATH_IMAGE024
Wherein:
ain response to the threshold value(s) being set,
Figure 100002_DEST_PATH_IMAGE025
for de-noising activation maps (ij) The value of the position is such that,
Figure 181987DEST_PATH_IMAGE026
for de-noising activation maps (ij) The target mask at the location of the position,
finally, the characteristic diagram S and the target mask are combined
Figure 355479DEST_PATH_IMAGE023
Multiplying corresponding positions and inputting the multiplied corresponding positions into a global average pooling layer to obtain target level characteristics
Figure 100002_DEST_PATH_IMAGE027
Figure 366161DEST_PATH_IMAGE028
To obtain
Figure 521067DEST_PATH_IMAGE027
Substituting the target characteristic class loss into the target characteristic class loss
Figure 100002_DEST_PATH_IMAGE029
Figure 617199DEST_PATH_IMAGE030
Wherein:
Figure 266486DEST_PATH_IMAGE004
a category score is assigned to the target-level feature,
p is the probability of a fine-grained class,
Figure 131674DEST_PATH_IMAGE009
indicating the number of samples for a batch,
I、T、V、Arespectively representing image, text, audio and video media types,
Figure 270532DEST_PATH_IMAGE010
is the fine-grained class probability of the image type sample,
Figure 978593DEST_PATH_IMAGE011
is a class label for the sample of the image type,
Figure 228309DEST_PATH_IMAGE012
for fine-grained class probabilities of text type samples,
Figure 400533DEST_PATH_IMAGE013
is a category label for the sample of text type,
Figure 710292DEST_PATH_IMAGE014
being the fine-grained class probabilities of the audio type samples,
Figure 781016DEST_PATH_IMAGE015
is a class label for the sample of the audio type,
Figure 834423DEST_PATH_IMAGE016
is the fine-grained class probability of a video type sample,
Figure 674203DEST_PATH_IMAGE017
is a class label for the sample of the video type,
kis a sample number, and is a sample number,
lpy) For the cross entropy loss function:
Figure 92546DEST_PATH_IMAGE018
whereinCIs the total number of categories.
In order to better implement the present invention, further, the step S400 specifically includes the following steps:
first, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map S: (i,j) Generating pixel-level auxiliary labels
Figure 100002_DEST_PATH_IMAGE031
Figure 384987DEST_PATH_IMAGE032
Wherein:
kis a sample number, and is a sample number,
c is the total number of the categories,
converting the numerical representation of each pixel class from 1 to C into a one-hot vector representation y:
Figure 100002_DEST_PATH_IMAGE033
wherein: m represents the mth of C +1,
inputting the feature map S into a convolution layer with convolution kernel of 1 × 1, and obtaining pixel-level features of (C +1) × H × W with N input channels, i.e. C +1 output channels
Figure 431965DEST_PATH_IMAGE034
Figure 100002_DEST_PATH_IMAGE035
Then the pixel level features are combined
Figure 126251DEST_PATH_IMAGE034
The class prediction score of each pixel of (a) is converted to a class probability by a Softmax function:
Figure 715496DEST_PATH_IMAGE036
at this point, the fine-grained classification of each pixel is lost
Figure 100002_DEST_PATH_IMAGE037
The calculation formula is as follows:
Figure 760812DEST_PATH_IMAGE038
separately accumulating target pixel fine-grained class losses and
Figure DEST_PATH_IMAGE039
fine granular class loss with background pixels and
Figure 342972DEST_PATH_IMAGE040
Figure DEST_PATH_IMAGE041
wherein:
Figure 829448DEST_PATH_IMAGE042
for the number of fine-grained classes of the target pixel,
Figure DEST_PATH_IMAGE043
for the number of fine-grained classes of background pixels,
final pixel level feature class loss
Figure 917490DEST_PATH_IMAGE044
By
Figure 371474DEST_PATH_IMAGE039
And
Figure 570374DEST_PATH_IMAGE040
and (3) linearly combining according to pixel proportion to obtain:
Figure DEST_PATH_IMAGE045
in order to better implement the present invention, further, the loss function of the feature extraction network in step S500 is the sum of the sample-level feature class loss, the target-level feature class loss, and the pixel-level feature class loss:
Figure 973674DEST_PATH_IMAGE046
wherein:
Figure 170300DEST_PATH_IMAGE007
for the sample-level feature class loss,
Figure 924629DEST_PATH_IMAGE029
for the purpose of the target-level feature class loss,
Figure 661641DEST_PATH_IMAGE044
is a pixel level feature class penalty.
In order to better implement the present invention, further, the sample-level feature class loss, the target-level feature class loss, and the pixel-level feature class loss are all obtained based on a feature map, and all use a cross entropy loss function to constrain class probabilities.
The invention has the beneficial effects that:
on the basis of the traditional sample-level features, the invention additionally introduces target-level features and pixel-level features of key regions, and constructs a training process of three class loss functions jointly constraining the deep convolutional network based on the three scale features. The method effectively solves the problems that only the category loss constraint of the sample-level features exists in the traditional public feature extraction method, the background noise and non-critical areas in the sample can possibly cause misleading to fine-grained category prediction, and the background noise and non-critical area features in the sample have larger influence. According to the invention, extra parameters are not required to be introduced, the calculation cost is hardly increased, the common features of the fine-grained data are more accurately extracted, and the fine-grained cross-media retrieval effect is further improved.
Drawings
FIG. 1 is a functional block diagram of the present invention;
FIG. 2 is a schematic diagram of Feature Maps activation regions;
FIG. 3 is a schematic view of a target location process;
fig. 4 is a comparison graph of target positioning accuracy.
Detailed Description
Example 1:
a fine-grained cross-media retrieval method based on multi-scale feature union is shown in FIG. 1, and comprises the following steps:
step S100: acquiring a cross-media data set containing image samples; processing the image sample by a deep convolutional neural network to obtain a group of N multiplied by H multiplied by W characteristic images, wherein N is the channel number of the characteristic images, and H and W are the length and the width of each characteristic image respectively;
step S200: inputting the feature map in the step S100 into a global average pooling layer to obtain sample-level features, processing the sample-level features through a full-connection layer, and calculating to obtain sample-level feature category loss;
step S300: sequentially accumulating the characteristic diagram in the step S100, reserving a maximum connected component, performing threshold value binarization processing to remove background interference, reserving a target key area to obtain a target level characteristic, and calculating to obtain a target level characteristic category loss;
step S400: setting a category label for each pixel of the feature map in the step S100, calculating and accumulating category loss functions of all pixels, realizing finer positioning of a target key area, obtaining pixel-level features, and calculating to obtain pixel-level feature category loss;
step S500: combining sample-level feature, target-level feature and pixel-level feature to jointly constrain a feature extraction network;
step S600: media features are extracted through a feature extraction network, the similarity among different media features is measured, and the media features are sequenced according to the similarity, so that retrieval is realized.
Further, in the step S100, a ResNet-50 network is adopted to extract a group of characteristic maps S of 2048 × 14 × 14, and the characteristic maps S are recorded as
Figure 371977DEST_PATH_IMAGE001
Where i =1, 2 …, N.
Example 2:
in this embodiment, optimization is performed on the basis of embodiment 1, and the step S200 specifically includes the following steps:
inputting the feature map S into a global average pooling layer to obtain sample-level features
Figure 801821DEST_PATH_IMAGE002
Figure 777867DEST_PATH_IMAGE003
The 2048-dimensional sample-level features are then combined
Figure 256253DEST_PATH_IMAGE002
Obtaining 200-dimensional class scores through 2048 x 200 full-connected layers
Figure 899724DEST_PATH_IMAGE004
Figure 234891DEST_PATH_IMAGE005
Will be provided with
Figure 963812DEST_PATH_IMAGE004
Obtaining fine-grained class probability through Softmax functionp
Figure 232507DEST_PATH_IMAGE006
Finally, constructing sample-level feature class loss by using sample class label y
Figure 730484DEST_PATH_IMAGE007
Figure 236552DEST_PATH_IMAGE008
Wherein:
Figure 390453DEST_PATH_IMAGE009
indicating the number of samples for a batch,
I、T、V、Arespectively representing image, text, audio and video media types,
Figure 272958DEST_PATH_IMAGE010
is the fine-grained class probability of the image type sample,
Figure 625442DEST_PATH_IMAGE011
is a class label for the sample of the image type,
Figure 754941DEST_PATH_IMAGE012
for fine-grained class probabilities of text type samples,
Figure 192876DEST_PATH_IMAGE013
is a category label for the sample of text type,
Figure 613493DEST_PATH_IMAGE014
being the fine-grained class probabilities of the audio type samples,
Figure 86062DEST_PATH_IMAGE015
is a class label for the sample of the audio type,
Figure 58566DEST_PATH_IMAGE016
is the fine-grained class probability of a video type sample,
Figure 983797DEST_PATH_IMAGE017
is a class label for the sample of the video type,
kis a sample number, and is a sample number,
lpy) For the cross entropy loss function:
Figure 145788DEST_PATH_IMAGE018
whereinCIs the total number of categories.
Other parts of this embodiment are the same as embodiment 1, and thus are not described again.
Example 3:
in this embodiment, optimization is performed on the basis of embodiment 1 or 2, and the step S300 specifically includes the following steps:
firstly, accumulating the characteristic diagram S along the channel dimension to obtain an original activation diagram A:
Figure 472864DEST_PATH_IMAGE019
then, the original activation image A is reserved with the maximum connected component to obtain a de-noised activation image
Figure 258680DEST_PATH_IMAGE020
Figure 671207DEST_PATH_IMAGE021
The de-noised activation map A is then based on the response mean
Figure 433626DEST_PATH_IMAGE022
Threshold value binarization is carried out to obtain a target mask
Figure 615209DEST_PATH_IMAGE023
Figure 257412DEST_PATH_IMAGE024
Wherein:
ain response to the threshold value(s) being set,
Figure 94918DEST_PATH_IMAGE025
for de-noising activation maps (ij) The value of the position is such that,
Figure 661028DEST_PATH_IMAGE026
for de-noising activation maps (ij) The target mask at the location of the position,
finally, the characteristic diagram S and the target mask are combined
Figure 697118DEST_PATH_IMAGE023
Multiplying corresponding positions and inputting the multiplied corresponding positions into a global average pooling layer to obtain target level characteristics
Figure 524870DEST_PATH_IMAGE027
Figure 911989DEST_PATH_IMAGE028
To obtain
Figure 16211DEST_PATH_IMAGE027
Substituting the target characteristic class loss into the target characteristic class loss
Figure 110069DEST_PATH_IMAGE029
Figure 907124DEST_PATH_IMAGE030
Wherein:
Figure 515960DEST_PATH_IMAGE004
a category score is assigned to the target-level feature,
p is the probability of a fine-grained class,
Figure 689452DEST_PATH_IMAGE009
indicating the number of samples for a batch,
I、T、V、Arespectively representing image, text, audio and video media types,
Figure 887084DEST_PATH_IMAGE010
is the fine-grained class probability of the image type sample,
Figure 855040DEST_PATH_IMAGE011
is a class label for the sample of the image type,
Figure 951172DEST_PATH_IMAGE012
for fine-grained class probabilities of text type samples,
Figure 334880DEST_PATH_IMAGE013
is a category label for the sample of text type,
Figure 465647DEST_PATH_IMAGE014
being the fine-grained class probabilities of the audio type samples,
Figure 604504DEST_PATH_IMAGE015
is a class label for the sample of the audio type,
Figure 374883DEST_PATH_IMAGE016
is the fine-grained class probability of a video type sample,
Figure 624599DEST_PATH_IMAGE017
is a class label for the sample of the video type,
kis a sample number, and is a sample number,
lpy) For the cross entropy loss function:
Figure 609873DEST_PATH_IMAGE018
whereinCIs the total number of categories.
Further, the step S400 specifically includes the following steps:
first, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map S: (i,j) Generating pixel-level auxiliary labels
Figure 44265DEST_PATH_IMAGE031
Figure 52672DEST_PATH_IMAGE032
Wherein:
kis a sample number, and is a sample number,
c is the total number of the categories,
converting the numerical representation of each pixel class from 1 to C into a one-hot vector representation y:
Figure 106079DEST_PATH_IMAGE033
wherein: m represents the mth of C +1,
inputting the feature map S into a convolution layer with convolution kernel of 1 × 1, and obtaining pixel-level features of (C +1) × H × W with N input channels, i.e. C +1 output channels
Figure 945859DEST_PATH_IMAGE034
Figure 426519DEST_PATH_IMAGE035
Then the pixel level features are combined
Figure 718960DEST_PATH_IMAGE034
The class prediction score of each pixel of (a) is converted to a class probability by a Softmax function:
Figure 576057DEST_PATH_IMAGE036
at this point, the fine-grained classification of each pixel is lost
Figure DEST_PATH_IMAGE047
The calculation formula is as follows:
Figure 460224DEST_PATH_IMAGE038
separately accumulating target pixel fine-grained class losses and
Figure 111786DEST_PATH_IMAGE039
fine granular class loss with background pixels and
Figure 94785DEST_PATH_IMAGE040
Figure 489994DEST_PATH_IMAGE041
wherein:
Figure 38787DEST_PATH_IMAGE042
for the number of fine-grained classes of the target pixel,
Figure 126829DEST_PATH_IMAGE043
for the number of fine-grained classes of background pixels,
final pixel level feature class loss
Figure 580813DEST_PATH_IMAGE044
By
Figure 514134DEST_PATH_IMAGE039
And
Figure 183013DEST_PATH_IMAGE040
and (3) linearly combining according to pixel proportion to obtain:
Figure 379639DEST_PATH_IMAGE045
the rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.
Example 4:
a fine-grained cross-media retrieval method based on multi-scale feature combination is characterized in that a sample-level feature scale is expanded into three scales of a sample level, a target level and a pixel level, and fine-grained category information loss constraint is carried out. Firstly, the defects of the traditional sample-level features are analyzed, and the idea of key area positioning is selected to obtain more accurate fine-grained features. And then, introducing target level features, removing background interference by a method of accumulating a feature map, carrying out mean binarization and reserving a maximum connected component, and reserving a target key area. And then, introducing pixel-level features, setting a class label for each pixel of the feature map, and calculating and accumulating class loss functions of all pixels to more finely position a target key area. And finally combining the class loss function constraint characteristics of the three characteristic scales to extract the network.
Further, as shown in fig. 1, taking an image media class as an example, steps and a flow of an MSFG (Multi-scale Fine-grained) method are presented.
Where lcc (large Connected component) denotes the largest Connected component retained,
AVG pool represents the global average pooling layer,
FC denotes a fully connected layer, Σ denotes accumulation,
conv is a number of the convolutional layers,
n represents the number of channels of the feature map, H and W are the length and width of each feature map respectively, and C represents the total number of categories.
As shown in FIG. 1, an image sample is first passed through a conventional deep convolutional neural network ResNet-50 to obtain a set of N × H × W Feature Maps (Feature Maps). And then, obtaining the feature class losses of three scales through three feature graph processing flows respectively, wherein Sample Loss, Target Loss and Pixel Loss in the graph correspond to Sample-level feature class Loss, Target-level feature class Loss and Pixel-level feature class Loss respectively. The sample-level features retain global information such as context, spatial position and related background, and have certain contribution to feature extraction of various media types, especially text media and audio media samples. Meanwhile, for a fine-grained data set, the global information often includes some unique semantic category information, and the unique information may be morphologically different within the same fine-grained category. Therefore, class loss of sample-level features has irreplaceability for learning of fine-grained features.
Further, the sample-level feature class loss is calculated as follows:
the 2048X 14 characteristic map (Feature Maps) extracted by ResNet-50 is recorded as
Figure 868389DEST_PATH_IMAGE001
(i =1, 2 …, N), inputting S into the global average pooling layer to obtain sample-level features
Figure 870980DEST_PATH_IMAGE002
Figure 581316DEST_PATH_IMAGE003
The 2048-dimensional sample-level features are then combined
Figure 745581DEST_PATH_IMAGE048
Obtaining 200-dimensional class scores through 2048 x 200 full-connected layers
Figure DEST_PATH_IMAGE049
Figure 987207DEST_PATH_IMAGE050
Will be provided with
Figure 465592DEST_PATH_IMAGE049
Obtaining the probability p of the fine-grained category through a Softmax function:
Figure DEST_PATH_IMAGE051
and finally, constructing a sample-level fine-grained class loss by using a sample class label y:
Figure 843484DEST_PATH_IMAGE052
whereinI,T,V,ARespectively representing image, text, audio and video media types,
Figure 631180DEST_PATH_IMAGE010
is the fine-grained class probability of the image type sample,
Figure 94523DEST_PATH_IMAGE011
is a class label for the sample of the image type,
Figure 438916DEST_PATH_IMAGE012
for fine-grained class probabilities of text type samples,
Figure 874577DEST_PATH_IMAGE013
is a category label for the sample of text type,
Figure 380645DEST_PATH_IMAGE014
being the fine-grained class probabilities of the audio type samples,
Figure 596862DEST_PATH_IMAGE015
is a class label for the sample of the audio type,
Figure 479368DEST_PATH_IMAGE016
is the fine-grained class probability of a video type sample,
Figure 21732DEST_PATH_IMAGE017
is a class label for the sample of the video type,
Figure DEST_PATH_IMAGE053
indicating the number of samples for a batch,
l(p, y) is the cross entropy loss function:
Figure 698701DEST_PATH_IMAGE054
further, in order to reduce the computational efficiency and the model complexity, the feature-based graph S of the three-scale class loss is obtained, and the class probabilities are all constrained by using a cross-entropy loss function, so that the following target-level loss
Figure DEST_PATH_IMAGE055
And pixel level loss
Figure 74319DEST_PATH_IMAGE056
The only difference from the sample-level penalty is the computational process from the feature map S to X.
Further, the target-level feature class loss is calculated as follows:
in the field of fine-grained cross-media retrieval, locating a target area and a key part is an effective method for extracting fine-grained features. Inspired by the SCDA algorithm, the convolution Feature map Maps in front of the global pooling layer in the deep convolutional network have activation responses aiming at different local positions on each channel.
As shown in fig. 2, the graph randomly selects a Feature map of 4 channels from Feature Maps for display, and it can be seen from the figure that: in the first row, channels 108 and 468 activate the legs of the hummingbirds, channel 375 activates the heads of the hummingbirds, and channel 284 does not even activate any region; in the second row, the 468 th, 375 th and 284 th channels activate the hummingbird head, feet and body, respectively, while the 108 th channel activates background noise. Experiments verify that the channels for activating background noise are few, and a large number of activation responses are concentrated in a target position and a key area thereof. As shown in fig. 3, Feature Maps are accumulated along the channel dimension, and then the maximum connected component is retained, so that the target position can be effectively positioned, background noise interference is eliminated, and binarization is performed to obtain a target mask.
Firstly, feature map
Figure DEST_PATH_IMAGE057
(i =1, 2 …, N) are accumulated along the channel dimension to yield the original activation mapA
Figure 947466DEST_PATH_IMAGE058
Then the original activation map is usedAObtaining a de-noised activation map by reserving the maximum connected component
Figure DEST_PATH_IMAGE059
Figure 154456DEST_PATH_IMAGE060
The de-noised activation map A is then based on the response mean
Figure DEST_PATH_IMAGE061
Threshold value binarization is carried out to obtain a target mask
Figure 205589DEST_PATH_IMAGE062
Figure DEST_PATH_IMAGE063
Finally, the characteristic diagram S and the target mask are combined
Figure 130819DEST_PATH_IMAGE062
Multiplying corresponding positions and inputting the multiplied corresponding positions into a global average pooling layer to obtain target level characteristics
Figure 479761DEST_PATH_IMAGE064
Figure DEST_PATH_IMAGE065
To obtain
Figure 993788DEST_PATH_IMAGE064
Substituting the obtained data into the obtained data to obtain the target-level fine-grained category loss
Figure 278139DEST_PATH_IMAGE055
Figure 425086DEST_PATH_IMAGE066
In contrast to sample-level features, target-level features focus only on the target region and its critical parts. The method can automatically eliminate the influence of background noise, further amplify the fine-grained characteristic of the target, enable the prediction of the fine-grained category probability to be completely dependent on the key information provided by the target area, and effectively improve the prediction accuracy of the fine-grained category.
Further, the pixel-level feature class loss is calculated as follows:
the pixel-level features set forth in this disclosure refer to pixels on the feature map, rather than pixels in the 448 x 3 matrix of the original sample. In the foregoing, there is a certain error and ambiguity in the method for locating the target position by using the feature map in the way of channel dimension accumulation, that is, the located target Mask is often not tightly attached to the target edge, and there is a part of background residue. As shown in fig. 4, the target location of fig. 4 (b) is more accurate and can focus on extracting more effective fine-grained features than fig. 4 (a).
In order to realize more accurate target positioning, on the basis of the target-level features, pixel-level features are further introduced to confirm the target area at a finer scale. First, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map: (ij) Generating pixel-level auxiliary labels
Figure DEST_PATH_IMAGE067
Figure 390768DEST_PATH_IMAGE068
Converting the numerical representation of each pixel class from 1 to C into a one-hot vector representation y:
Figure DEST_PATH_IMAGE069
will feature map
Figure 572351DEST_PATH_IMAGE057
(i =1, 2 …, N) a convolution layer having a convolution kernel of 1 × 1 is input, and the number of input channels is N
The number of output channels is C +1, namely 201, and the pixel level characteristics of (C +1) multiplied by H multiplied by W are obtained
Figure 27603DEST_PATH_IMAGE070
Figure DEST_PATH_IMAGE071
Then the pixel level features are combined
Figure 927426DEST_PATH_IMAGE070
The class prediction score of each pixel of (a) is converted to a class probability by a Softmax function:
Figure 493536DEST_PATH_IMAGE072
at this time, the fine-grained category loss calculation formula of each pixel point is as follows:
Figure 719506DEST_PATH_IMAGE073
separately accumulating target pixel fine-grained class losses and
Figure 345659DEST_PATH_IMAGE074
fine granular class loss with background pixels and
Figure 467199DEST_PATH_IMAGE075
Figure 961634DEST_PATH_IMAGE076
final pixel level fine-grained class loss
Figure 55492DEST_PATH_IMAGE056
By
Figure 852547DEST_PATH_IMAGE074
And
Figure 461383DEST_PATH_IMAGE075
and (3) linearly combining according to pixel proportion to obtain:
Figure 369296DEST_PATH_IMAGE077
in general, the class prediction difficulty of a fine-grained sample is generally proportional to the size of a target, and the higher the target proportion is, the higher the probability of accurate class prediction is. Loss of target
Figure 379977DEST_PATH_IMAGE074
And background loss
Figure 347933DEST_PATH_IMAGE075
The smaller the target is, the higher the proportion of the background to the total pixels is, and the higher the weight lost by the target is, so that the network parameters are more biased to the feature learning of the target area. Finally, the network total loss function is the sum of the class losses of the sample-level, target-level and pixel-level features:
Figure 631016DEST_PATH_IMAGE078
on the basis of the traditional sample-level features, the invention additionally introduces target-level features and pixel-level features of key regions, and constructs a training process of three class loss functions jointly constraining the deep convolutional network based on the three scale features. The method effectively solves the problems that only the category loss constraint of the sample-level features exists in the traditional public feature extraction method, the background noise and non-critical areas in the sample can possibly cause misleading to fine-grained category prediction, and the background noise and non-critical area features in the sample have larger influence. According to the invention, extra parameters are not required to be introduced, the calculation cost is hardly increased, the common features of the fine-grained data are more accurately extracted, and the fine-grained cross-media retrieval effect is further improved.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims (7)

1. A fine-grained cross-media retrieval method based on multi-scale feature union is characterized by comprising the following steps:
step S100: acquiring a cross-media data set containing image samples; processing the image sample by a deep convolutional neural network to obtain a group of N multiplied by H multiplied by W characteristic images, wherein N is the channel number of the characteristic images, and H and W are the length and the width of each characteristic image respectively;
step S200: inputting the feature map in the step S100 into a global average pooling layer to obtain sample-level features, processing the sample-level features through a full-connection layer, and calculating to obtain sample-level feature category loss;
step S300: sequentially accumulating the characteristic diagram in the step S100, reserving a maximum connected component, performing threshold value binarization processing to remove background interference, reserving a target key area to obtain a target level characteristic, and calculating to obtain a target level characteristic category loss;
step S400: setting a category label for each pixel of the feature map in the step S100, calculating and accumulating category loss functions of all pixels, realizing finer positioning of a target key area, obtaining pixel-level features, and calculating to obtain pixel-level feature category loss;
step S500: combining sample-level feature, target-level feature and pixel-level feature to jointly constrain a feature extraction network;
step S600: media features are extracted through a feature extraction network, the similarity among different media features is measured, and the media features are sequenced according to the similarity, so that retrieval is realized.
2. The fine-grained cross-media retrieval method based on multi-scale feature union as claimed in claim 1,it is characterized in that in the step S100, a ResNet-50 network is adopted to extract a group of characteristic maps S of 2048 multiplied by 14 and recorded as
Figure DEST_PATH_IMAGE001
Where i =1, 2 …, N.
3. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 2, wherein the step S200 specifically comprises the following steps:
inputting the feature map S into a global average pooling layer to obtain sample-level features
Figure 651760DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
The 2048-dimensional sample-level features are then combined
Figure 631218DEST_PATH_IMAGE002
Obtaining 200-dimensional class scores through 2048 x 200 full-connected layers
Figure 410955DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
Will be provided with
Figure 868481DEST_PATH_IMAGE004
Obtaining fine-grained class probability through Softmax functionp
Figure 682853DEST_PATH_IMAGE006
Finally, the advantages ofConstructing sample-level feature class loss with sample class label y
Figure DEST_PATH_IMAGE007
Figure 582281DEST_PATH_IMAGE008
Wherein:
Figure DEST_PATH_IMAGE009
indicating the number of samples for a batch,
I、T、V、Arespectively representing image, text, audio and video media types,
Figure DEST_PATH_IMAGE010
is the fine-grained class probability of the image type sample,
Figure DEST_PATH_IMAGE011
is a class label for the sample of the image type,
Figure DEST_PATH_IMAGE012
for fine-grained class probabilities of text type samples,
Figure DEST_PATH_IMAGE013
is a category label for the sample of text type,
Figure DEST_PATH_IMAGE014
being the fine-grained class probabilities of the audio type samples,
Figure DEST_PATH_IMAGE015
is a class label for the sample of the audio type,
Figure DEST_PATH_IMAGE016
is the fine-grained class probability of a video type sample,
Figure DEST_PATH_IMAGE017
is a class label for the sample of the video type,
kis a sample number, and is a sample number,
lpy) For the cross entropy loss function:
Figure DEST_PATH_IMAGE018
whereinCIs the total number of categories.
4. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 2, wherein the step S300 specifically comprises the following steps:
firstly, accumulating the characteristic diagram S along the channel dimension to obtain an original activation diagramA
Figure DEST_PATH_IMAGE019
Then the original activation map is usedAObtaining a de-noised activation map by reserving the maximum connected component
Figure DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE021
The de-noised activation map A is then based on the response mean
Figure DEST_PATH_IMAGE022
Threshold value binarization is carried out to obtain a target mask
Figure DEST_PATH_IMAGE023
Figure DEST_PATH_IMAGE024
Wherein:
ain response to the threshold value(s) being set,
Figure DEST_PATH_IMAGE025
for de-noising activation maps (ij) The value of the position is such that,
Figure 285533DEST_PATH_IMAGE026
for de-noising activation maps (ij) The target mask at the location of the position,
finally, the characteristic diagram S and the target mask are combined
Figure 484433DEST_PATH_IMAGE023
Multiplying corresponding positions and inputting the multiplied corresponding positions into a global average pooling layer to obtain target level characteristics
Figure DEST_PATH_IMAGE027
Figure 218558DEST_PATH_IMAGE028
To obtain
Figure 211922DEST_PATH_IMAGE027
Substituting the target characteristic class loss into the target characteristic class loss
Figure DEST_PATH_IMAGE029
Figure 28568DEST_PATH_IMAGE030
Wherein:
Figure 765580DEST_PATH_IMAGE004
a category score is assigned to the target-level feature,
p is the probability of a fine-grained class,
Figure 288965DEST_PATH_IMAGE009
indicating the number of samples for a batch,
I、T、V、Arespectively representing image, text, audio and video media types,
Figure 781126DEST_PATH_IMAGE010
is the fine-grained class probability of the image type sample,
Figure 22752DEST_PATH_IMAGE011
is a class label for the sample of the image type,
Figure 297875DEST_PATH_IMAGE012
for fine-grained class probabilities of text type samples,
Figure 941346DEST_PATH_IMAGE013
is a category label for the sample of text type,
Figure 276513DEST_PATH_IMAGE014
being the fine-grained class probabilities of the audio type samples,
Figure 5434DEST_PATH_IMAGE015
is a class label for the sample of the audio type,
Figure 84249DEST_PATH_IMAGE016
is the fine-grained class probability of a video type sample,
Figure 644543DEST_PATH_IMAGE017
is a class label for the sample of the video type,
kis a sample number, and is a sample number,
lpy) For the cross entropy loss function:
Figure 416190DEST_PATH_IMAGE018
whereinCIs the total number of categories.
5. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 4, wherein the step S400 specifically comprises the following steps:
first, since the class labels of the dataset are only sample-level labels, it is necessary for each location in the feature map S: (i,j) Generating pixel-level auxiliary labels
Figure DEST_PATH_IMAGE031
Figure 429145DEST_PATH_IMAGE032
Wherein:
kis a sample number, and is a sample number,
c is the total number of the categories,
converting the numerical representation of each pixel class from 1 to C into a one-hot vector representation y:
Figure DEST_PATH_IMAGE033
wherein: m represents the mth of C +1,
inputting the feature map S into a convolution layer with convolution kernel of 1 × 1, and obtaining pixel-level features of (C +1) × H × W with N input channels, i.e. C +1 output channels
Figure 311651DEST_PATH_IMAGE034
Figure DEST_PATH_IMAGE035
Then the pixel level features are combined
Figure 726452DEST_PATH_IMAGE034
The class prediction score of each pixel of (a) is converted to a class probability by a Softmax function:
Figure 669000DEST_PATH_IMAGE036
at this point, the fine-grained classification of each pixel is lost
Figure DEST_PATH_IMAGE037
The calculation formula is as follows:
Figure 169251DEST_PATH_IMAGE038
separately accumulating target pixel fine-grained class losses and
Figure 589868DEST_PATH_IMAGE039
fine granular class loss with background pixels and
Figure 62438DEST_PATH_IMAGE040
Figure 238204DEST_PATH_IMAGE041
wherein:
Figure 163435DEST_PATH_IMAGE042
for the number of fine-grained classes of the target pixel,
Figure 387743DEST_PATH_IMAGE043
for the number of fine-grained classes of background pixels,
final pixel level feature class loss
Figure 514486DEST_PATH_IMAGE044
By
Figure 798837DEST_PATH_IMAGE039
And
Figure 211364DEST_PATH_IMAGE040
and (3) linearly combining according to pixel proportion to obtain:
Figure 301680DEST_PATH_IMAGE045
6. the fine-grained cross-media retrieval method based on multi-scale feature union of claim 1, wherein the loss function of the feature extraction network in step S500 is the sum of sample-level feature class loss, target-level feature class loss and pixel-level feature class loss:
Figure 483262DEST_PATH_IMAGE046
wherein:
Figure 672935DEST_PATH_IMAGE007
for the sample-level feature class loss,
Figure 635075DEST_PATH_IMAGE029
for the purpose of the target-level feature class loss,
Figure 201186DEST_PATH_IMAGE044
is a pixel level feature class penalty.
7. The fine-grained cross-media retrieval method based on multi-scale feature union according to claim 6, wherein the sample-level feature class loss, the target-level feature class loss and the pixel-level feature class loss are all obtained based on a feature map, and all use a cross entropy loss function to constrain class probability.
CN202111258804.8A 2021-10-28 2021-10-28 Fine-grained cross-media retrieval method based on multi-scale feature union Active CN113704537B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111258804.8A CN113704537B (en) 2021-10-28 2021-10-28 Fine-grained cross-media retrieval method based on multi-scale feature union

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111258804.8A CN113704537B (en) 2021-10-28 2021-10-28 Fine-grained cross-media retrieval method based on multi-scale feature union

Publications (2)

Publication Number Publication Date
CN113704537A true CN113704537A (en) 2021-11-26
CN113704537B CN113704537B (en) 2022-02-15

Family

ID=78647132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111258804.8A Active CN113704537B (en) 2021-10-28 2021-10-28 Fine-grained cross-media retrieval method based on multi-scale feature union

Country Status (1)

Country Link
CN (1) CN113704537B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176202B1 (en) * 2018-03-06 2019-01-08 Xanadu Big Data, Llc Methods and systems for content-based image retrieval
CN109800629A (en) * 2018-12-05 2019-05-24 天津大学 A kind of Remote Sensing Target detection method based on convolutional neural networks
CN111104538A (en) * 2019-12-06 2020-05-05 深圳久凌软件技术有限公司 Fine-grained vehicle image retrieval method and device based on multi-scale constraint
CN111782833A (en) * 2020-06-09 2020-10-16 南京理工大学 Fine-grained cross-media retrieval method based on multi-model network
CN112668494A (en) * 2020-12-31 2021-04-16 西安电子科技大学 Small sample change detection method based on multi-scale feature extraction
CN112800249A (en) * 2021-02-01 2021-05-14 南京理工大学 Fine-grained cross-media retrieval method based on generation of countermeasure network
CN113064959A (en) * 2020-01-02 2021-07-02 南京邮电大学 Cross-modal retrieval method based on deep self-supervision sorting Hash
CN113270199A (en) * 2021-04-30 2021-08-17 贵州师范大学 Medical cross-modal multi-scale fusion class guidance hash method and system thereof
US20210256365A1 (en) * 2017-04-10 2021-08-19 Peking University Shenzhen Graduate School Cross-media retrieval method based on deep semantic space

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210256365A1 (en) * 2017-04-10 2021-08-19 Peking University Shenzhen Graduate School Cross-media retrieval method based on deep semantic space
US10176202B1 (en) * 2018-03-06 2019-01-08 Xanadu Big Data, Llc Methods and systems for content-based image retrieval
CN109800629A (en) * 2018-12-05 2019-05-24 天津大学 A kind of Remote Sensing Target detection method based on convolutional neural networks
CN111104538A (en) * 2019-12-06 2020-05-05 深圳久凌软件技术有限公司 Fine-grained vehicle image retrieval method and device based on multi-scale constraint
CN113064959A (en) * 2020-01-02 2021-07-02 南京邮电大学 Cross-modal retrieval method based on deep self-supervision sorting Hash
CN111782833A (en) * 2020-06-09 2020-10-16 南京理工大学 Fine-grained cross-media retrieval method based on multi-model network
CN112668494A (en) * 2020-12-31 2021-04-16 西安电子科技大学 Small sample change detection method based on multi-scale feature extraction
CN112800249A (en) * 2021-02-01 2021-05-14 南京理工大学 Fine-grained cross-media retrieval method based on generation of countermeasure network
CN113270199A (en) * 2021-04-30 2021-08-17 贵州师范大学 Medical cross-modal multi-scale fusion class guidance hash method and system thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DONG JIANFENG: "Cross-media Relevance Computation for Multimedia Retrieval", 《ACM DIGITAL LIBRARY》 *
周菊香: "图像检索中的特征表达和相似性度量方法研究", 《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》 *

Also Published As

Publication number Publication date
CN113704537B (en) 2022-02-15

Similar Documents

Publication Publication Date Title
CN109919108B (en) Remote sensing image rapid target detection method based on deep hash auxiliary network
CN108492272B (en) Cardiovascular vulnerable plaque identification method and system based on attention model and multitask neural network
Liu et al. Star-net: a spatial attention residue network for scene text recognition.
CN109886121B (en) Human face key point positioning method for shielding robustness
US8111923B2 (en) System and method for object class localization and semantic class based image segmentation
CN110807422A (en) Natural scene text detection method based on deep learning
CN109684922B (en) Multi-model finished dish identification method based on convolutional neural network
Roy et al. Bayesian classifier for multi-oriented video text recognition system
CN111339975B (en) Target detection, identification and tracking method based on central scale prediction and twin neural network
Li et al. Category dictionary guided unsupervised domain adaptation for object detection
CN111783576A (en) Pedestrian re-identification method based on improved YOLOv3 network and feature fusion
CN112800249A (en) Fine-grained cross-media retrieval method based on generation of countermeasure network
CN111738090A (en) Pedestrian re-recognition model training method and device and pedestrian re-recognition method and device
CN112017192A (en) Glandular cell image segmentation method and system based on improved U-Net network
Wang et al. Multi-scale fish segmentation refinement and missing shape recovery
CN111553349A (en) Scene text positioning and identifying method based on full convolution network
Makhmudov et al. Improvement of the end-to-end scene text recognition method for “text-to-speech” conversion
CN111553351A (en) Semantic segmentation based text detection method for arbitrary scene shape
CN114510594A (en) Traditional pattern subgraph retrieval method based on self-attention mechanism
Srihari et al. Forensic handwritten document retrieval system
CN111815582A (en) Two-dimensional code area detection method for improving background prior and foreground prior
CN114882204A (en) Automatic ship name recognition method
CN113657225A (en) Target detection method
CN113704537B (en) Fine-grained cross-media retrieval method based on multi-scale feature union
CN111210433B (en) Markov field remote sensing image segmentation method based on anisotropic potential function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant