CN115630178A - Cross-media retrieval method based on channel fine-grained semantic features - Google Patents

Cross-media retrieval method based on channel fine-grained semantic features Download PDF

Info

Publication number
CN115630178A
CN115630178A CN202211417363.6A CN202211417363A CN115630178A CN 115630178 A CN115630178 A CN 115630178A CN 202211417363 A CN202211417363 A CN 202211417363A CN 115630178 A CN115630178 A CN 115630178A
Authority
CN
China
Prior art keywords
media
fine
grained
cross
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211417363.6A
Other languages
Chinese (zh)
Inventor
姚亚洲
沈复民
孙泽人
陈涛
白泞玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Code Geek Technology Co ltd
Original Assignee
Nanjing Code Geek Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Code Geek Technology Co ltd filed Critical Nanjing Code Geek Technology Co ltd
Priority to CN202211417363.6A priority Critical patent/CN115630178A/en
Publication of CN115630178A publication Critical patent/CN115630178A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-media retrieval method based on channel fine-grained semantic features, which comprises the following steps of S1, firstly, generating a feature map with rich channel information through a deep network; s2, dividing according to channels and inputting the divided parts into a fine-grained learning layer and a cross-media learning layer; and S3, adding the results of the fine-grained loss function and the cross-media loss function to obtain cross-media joint loss. The method comprises the steps of grouping channels to represent each local key area, generating different local key areas by using global loss, learning fine-grained semantic features of each local key area by using local loss, and measuring correlation among different media data by using cross-media loss; compared with the traditional cross-media retrieval method, the method can simultaneously learn the fine-grained semantic features of different media for cross-media retrieval, and can avoid high training cost caused by designing a special network for each type of media data.

Description

Cross-media retrieval method based on channel fine-grained semantic features
Technical Field
The invention relates to the technical field of cross-media retrieval, in particular to a cross-media retrieval method based on channel fine-grained semantic features.
Background
In the past few years, unsupervised fine-grained feature extraction methods have been widely studied, which aim to extract discriminant local key regions in a feature map and then perform end-to-end training by learning the relationship between different local key regions in the same input data and the difference between local key regions in different input data. The model is typically divided into two sub-networks during the training phase, the first network being used to generate local key regions, and the second network learning fine-grained semantic features between key regions.
Although the fine-grained feature extraction network structure based on the local key region only needs image-level labeling, the network training method is similar to supervised learning, and higher model complexity and training difficulty are needed. Therefore, extraction of fine-grained features of different media and learning of cross-media correlation using these methods will bring unworkable training time and model complexity.
With the development of CNN, researchers can make features of the same class have compact classes and make features of different classes sparse by only designing a loss function of a specific task according to the characteristics of a fine-grained data set that the intra-class variance is large and the inter-class variance is small. For example, in the 'a discrete feature acquisition for depth surface registration' paper of the European conference on computer vision conference of 2016, a center loss is proposed, and the distance of each feature from the center point of a category is measured by setting the center point for the category and updating the position of the center point in each iteration, which can effectively distribute the feature sets having the same category together. Although the methods do not need to design a complex network structure and can obtain the discrimination information of a fine granularity level only by optimizing the loss function, the methods are very sensitive to the training data containing noise because the local key area of the target is not extracted. Because the data of the same category in the fine-grained cross-media data set contains four different media, namely images, videos, audios and texts, if local key regions of the media data are not extracted, the fine-grained semantic features of the media data are directly learned, and the model is easily influenced by noise of different media data to cause low convergence speed or even no convergence.
In contrast, in The device is in The channels of The IEEE Transactions on Image Processing journal in 2020, the correlation between fine-grained local regions is studied on The channels of The feature map, and they divide The number of channels uniformly into groups, and each group of channel features represents a class, so as to perform fine-grained classification of images.
Disclosure of Invention
Inspired by the research, the invention provides a cross-media retrieval method CFSFL (channel fine-grained-semantic feature) based on channel fine-grained semantic features, which is used for generating local key regions to learn fine-grained semantic representations and cross-media correlation of different media features.
In order to achieve the purpose, the invention provides the following technical scheme: a cross-media retrieval method based on channel fine-grained semantic features comprises the following steps:
s1, firstly, generating a characteristic diagram with rich channel information through a deep network;
s2, dividing according to channels and inputting the divided parts into a fine-grained learning layer and a cross-media learning layer;
s21, inputting four media data, namely image data, video data, audio data and text data, into a fine-grained learning layer respectively to learn fine-grained distinguishing characteristics, and outputting the fine-grained distinguishing characteristics as fine-grained loss;
s22, jointly inputting the four media data in the S21 to a cross-media learning layer to learn cross-media correlation, and outputting cross-media loss;
and S3, adding the results of the fine-grained loss function and the cross-media loss function to obtain cross-media combined loss.
Further, in S2, in the fine-grained cross-media retrieval task, the input data includes four media, namely, an image, an audio, a video, and a text; training by adopting a multimedia mixed input method, and performing combined input by equally sampling different media data, wherein the network input is
Figure 813422DEST_PATH_IMAGE001
In which
Figure 603523DEST_PATH_IMAGE002
Representing images, video, audio, text,
Figure 528492DEST_PATH_IMAGE003
labels representing them; extracting feature diagram of high-dimensional channel of media type data by using uniform network, and outputting the feature diagram
Figure 937608DEST_PATH_IMAGE004
In which
Figure 133097DEST_PATH_IMAGE005
The feature vector of size, c the number of channels, h the length of the feature map, and w the width of the feature map.
Furthermore, in the output characteristics of the characteristic extractor, four different media data are divided according to channels, each group of channels represents different characteristic regions with fine-grained discriminability, the channels of the four media data are equally divided into n groups, and the characteristic size of each group is
Figure 94099DEST_PATH_IMAGE006
By for each group
Figure 273408DEST_PATH_IMAGE007
In one channel
Figure 719171DEST_PATH_IMAGE008
Of a number of channels
Figure 34745DEST_PATH_IMAGE009
All the eigenvectors in the big-small feature map are set to zero and all the groups are scrambled
Figure 901070DEST_PATH_IMAGE007
The front and back spatial positions of the feature map of each channel; measuring the grouped feature graph through cross-media joint loss, wherein a loss function consists of fine-grained loss and cross-media loss; the cross-media joint loss is defined as follows:
Figure 302096DEST_PATH_IMAGE010
wherein, the first and the second end of the pipe are connected with each other,
Figure 318593DEST_PATH_IMAGE011
fine grain loss of each media data, including image data, video data, audio data, text data;
Figure 347729DEST_PATH_IMAGE012
in order to be lost across the medium,
Figure 290015DEST_PATH_IMAGE013
is a hyper-parameter for controlling the degree of influence across media losses.
Further, in S21, fine-grained semantic features of four media data, namely, an image, a video, an audio and a text, are extracted and used for loss measurement by learning global and local relations between fine-grained local key regions, and fine-grained loss of the media data is defined as follows:
Figure 178337DEST_PATH_IMAGE014
wherein the content of the first and second substances,
Figure 732946DEST_PATH_IMAGE015
which represents a local loss of fine-grained,
Figure 757534DEST_PATH_IMAGE016
representing the fine-grained global penalty, and m represents the weight of the global penalty term.
Further, performing channel average pooling and channel maximum pooling on all feature maps in each group of channels, wherein the channel average pooling layer is performed on each group of channels
Figure 231240DEST_PATH_IMAGE006
The size feature maps are added together by position and then divided by
Figure 105393DEST_PATH_IMAGE007
The output size of each group of feature maps is
Figure 198114DEST_PATH_IMAGE017
(ii) a Maximum pooling of channels for each group
Figure 342788DEST_PATH_IMAGE006
The size of the feature map is maximum according to position, and the output size of each group of feature maps is
Figure 252975DEST_PATH_IMAGE017
(ii) a Obtaining the feature representation of all local key areas by performing channel average pooling and channel maximum pooling on all groups, and then adding the two output results according to positions, wherein the size of the total output feature map is
Figure 850309DEST_PATH_IMAGE018
(ii) a Then, the feature map is input into a global average pooling layer to extract semantic representation of each local key region, and the output feature size is
Figure 245256DEST_PATH_IMAGE019
(ii) a Global average pooling layer is achieved by grouping each group
Figure 978857DEST_PATH_IMAGE017
All feature points in the size feature map are added and then divided by
Figure 794366DEST_PATH_IMAGE009
To obtain the semantic features of the feature map and output as
Figure 878997DEST_PATH_IMAGE020
Size; respectively calculating local losses for the n local key region characteristics, wherein the fine-grained local losses are defined as follows:
Figure 844679DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 196901DEST_PATH_IMAGE002
respectively representing images, video, audio, text,
Figure 183311DEST_PATH_IMAGE022
is a label, and is a label,
Figure 489659DEST_PATH_IMAGE023
are probabilistic features.
Further, the global representation of the feature map is learned by global penalties, which will first be
Figure 727873DEST_PATH_IMAGE024
The feature map of the size is input after being grouped by channels
Figure 826279DEST_PATH_IMAGE025
Calculating the probability of all feature maps in the function, expressing the output features as the weight of the feature points in each feature map, and outputting the feature with the size
Figure 858957DEST_PATH_IMAGE024
(ii) a In order to obtain the most representative feature map of each local feature, each group is processed by the maximum pooling layer of the channel
Figure 416715DEST_PATH_IMAGE007
Fitting the feature information of each feature map into one feature map, taking the maximum value of all the feature maps in the same group according to the position by the maximum pooling layer, and outputting the feature map of each group as
Figure 458621DEST_PATH_IMAGE017
(ii) a The most representative n characteristic graphs can be obtained by performing channel maximum pooling on all local key areas; finally, the correlation between these n regions is calculated by global penalty, which is defined as follows:
Figure 880375DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 349533DEST_PATH_IMAGE002
respectively representing images, videos, audios and texts, wherein n is the number of local regions, h is the length of the feature map, w is the width of the feature map, and x is each feature point on the feature map.
Further, in S22, for the input size is
Figure 630473DEST_PATH_IMAGE027
The feature representation of the media data on each channel is extracted through a global average pooling layer, and the output size is
Figure 600703DEST_PATH_IMAGE028
(ii) a The difference between these media data is measured by cross-media loss, which is defined as follows:
Figure 539882DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 179942DEST_PATH_IMAGE002
respectively representing images, video, audio, text,
Figure 213757DEST_PATH_IMAGE030
which represents the ith input sample, is then,
Figure 722099DEST_PATH_IMAGE031
the category center of the ith sample is represented.
Furthermore, in S4, the performance of the cross-media retrieval method for channel fine-grained semantic features is examined by using the mep score in an experiment, and the specific calculation method is as follows:
counting the number of search results TP, FN, FP and TN by using a confusion matrix, wherein TP represents that a data label is a positive sample, and the search result is the positive sample; FN represents that the data label is a positive sample, the retrieval result is a negative sample, FP represents that the data label is a negative sample, the retrieval result is a positive sample, TN represents that the data label is a negative sample, and the retrieval result is a negative sample;
firstly, the accuracy P is calculated according to the confusion matrix, and the calculation formula is as follows:
Figure 259390DEST_PATH_IMAGE032
and measuring the missed detection degree of the model by calculating the recall rate of cross-media retrieval, and calculating the recall rate R according to the confusion matrix, wherein the calculation formula is as follows:
Figure 303307DEST_PATH_IMAGE033
therefore, the average retrieval accuracy AP can be calculated, and the calculation formula is as follows:
Figure 824419DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure 870872DEST_PATH_IMAGE035
representing the function of the accuracy P about the recall rate R, N is the total quantity of the searched features, rel takes the value of 0 or 1, and when the category of the input features is the same as that of the search result
Figure 262670DEST_PATH_IMAGE036
When the category of the input feature is different from the category of the search result
Figure 244533DEST_PATH_IMAGE037
And finally, calculating the average value of the AP as mAP, calculating the average retrieval precision of each class by the mAP, and then calculating the average value of all the classes, wherein the calculation formula is as follows:
Figure 111995DEST_PATH_IMAGE038
wherein Q is the number of searches.
Compared with the prior art, the invention has the beneficial effects that: according to the method, channels are grouped to represent each local key area, global loss is used for generating different local key areas, then the local loss is used for learning fine-grained semantic features of each local key area, and finally cross-media loss is used for measuring correlation among different media data; compared with the traditional fine-grained feature learning method, the method is simpler, more convenient and more flexible, can automatically generate the required local key area, and does not need to design high calculation complexity brought by a local area positioning network; meanwhile, compared with the traditional cross-media retrieval method, the method can simultaneously learn the fine-grained semantic features of different media for cross-media retrieval, and can avoid high training cost caused by designing a special network for each type of media data. A large number of experiments and ablative studies verify the effectiveness of the method of the invention.
Drawings
FIG. 1 is a schematic diagram of a fine-grained cross-media retrieval network structure according to the present invention;
FIG. 2 is a schematic diagram of a local key area of the present invention;
FIG. 3 is a schematic diagram of fine-grained learning according to the present invention;
FIG. 4 is a schematic diagram of a confusion matrix according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 2, fig. 3 and fig. 4, the present invention is a cross-media retrieval method based on channel fine-grained semantic features, which includes the following steps:
s1, firstly, generating a feature map with rich channel information through a deep network;
s2, dividing according to channels and inputting the divided parts into a fine-grained learning layer and a cross-media learning layer;
s21, inputting four media data, namely image data, video data, audio data and text data, into a fine-grained learning layer respectively to learn fine-grained distinguishing characteristics, and outputting the fine-grained distinguishing characteristics as fine-grained loss;
s22, jointly inputting the four media data in the S21 to a cross-media learning layer to learn cross-media correlation, and outputting cross-media loss;
and S3, adding the results of the fine-grained loss function and the cross-media loss function to obtain cross-media joint loss.
In this embodiment, in S2, in the fine-grained cross-media retrieval task, the input data includes four media, namely, image, audio, video, and text; in order to facilitate learning of cross-media correlation, a multimedia mixed input method is adopted for training, different media data are equally sampled for joint input, and network input is
Figure 336040DEST_PATH_IMAGE001
Wherein
Figure 847924DEST_PATH_IMAGE002
Representing images, video, audio, text,
Figure 688DEST_PATH_IMAGE039
labels representing them; in order to reduce the number of model parameters, the application adopts a uniform Resnet50 network (not comprising the final average pooling layer and the full connection layer) to extract the feature maps of the high-dimensional channels of the four media type data, and because the network does not comprise any linear layer, the output of the network contains rich semantic information unique to the media, the output is,
Figure 89867DEST_PATH_IMAGE004
in which
Figure 353489DEST_PATH_IMAGE005
The feature vector of size, c the number of channels, h the length of the feature map, and w the width of the feature map.
Then, four different media data are divided according to channels in the output characteristics of the characteristic extractor, each group of channels represents different characteristic areas with fine-grained discriminant, such as image data represented as local key areas, text data represented as keyword vectors in texts, video represented as local key areas of the current frame, and audio data represented as characteristics with discriminant in spectrogram. The channel division mode can avoid the calculation complexity brought by designing a local area feature extraction network.
As shown in fig. 2, different channels of the features correspond to local regions with different targets, and the channels of the four media data are equally divided into n groups, and each group has a feature size of
Figure 218414DEST_PATH_IMAGE040
Where the ungrouped features are all zeroed. In order to reduce the amount of calculation and increase the generalization capability of the model, dropout operation needs to be performed on the grouped features.
Experiment ofBy for each group
Figure 666713DEST_PATH_IMAGE041
In one channel
Figure 384134DEST_PATH_IMAGE042
Of a number of channels
Figure 451447DEST_PATH_IMAGE043
All the feature vectors in the big and small feature maps are set to zero and all the groups are scrambled
Figure 672344DEST_PATH_IMAGE041
The front and back spatial positions of the feature map of each channel. Unlike the way in which weights are computed for all channels in each group, the method of the present invention randomly discards some channels during training, which allows all feature maps in the same group to try to learn local key region information, and then performs a loss metric by fitting all feature maps in the same group of channels to obtain sufficient local key features.
Finally, the grouped characteristic graph is measured through the cross-media joint loss provided by the invention, and the loss function consists of fine-grained loss and cross-media loss. The fine-grained penalties are used to learn distinctions between these local key features to generate learned fine-grained discriminative information, and the associations between these local key features can be learned to generate local key regions. Cross-media loss learns cross-media relevance by measuring the difference between global features. The cross-media joint loss is defined as follows:
Figure 291544DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 729216DEST_PATH_IMAGE011
fine grain loss of each media data, including image data, video data, audio data, text data;
Figure 600220DEST_PATH_IMAGE012
in order to be lost across the medium,
Figure 800257DEST_PATH_IMAGE013
is a hyper-parameter for controlling the degree of influence across media losses.
In this embodiment, as shown in fig. 3, a main flow of fine-grained learning proposed in the present application is shown in detail, in which the method extracts fine-grained semantic features of four media data, namely, images, videos, audios, and texts, and uses the extracted fine-grained semantic features to perform loss measurement by learning global and local relationships between fine-grained local key regions, and total loss is defined as follows:
Figure 731304DEST_PATH_IMAGE014
wherein the content of the first and second substances,
Figure 157738DEST_PATH_IMAGE015
which represents a local loss of fine granularity,
Figure 65388DEST_PATH_IMAGE016
representing the fine-grained global penalty, and m represents the weight of the global penalty term.
The method provided by the application divides the whole feature map into n groups to respectively extract fine-grained semantic representations of the n local key regions, and the feature map in each group of channels represents one local key region. To be provided with
Figure 119932DEST_PATH_IMAGE044
Size of feature map is taken as an example, after grouping the feature map by channels, the size of the feature map is
Figure 487460DEST_PATH_IMAGE040
The channel feature map of (a) is represented as a local key region, wherein c is the number of channels, h is the length of the feature map, w is the width of the feature map, and n is the number of local key regions to be learned (usually
Figure 870030DEST_PATH_IMAGE045
200 is the number of fine-grained classifications).
In order to effectively extract fine-grained semantic features in each local key area, performing channel average pooling and channel maximum pooling on all feature maps in each group of channels; channel average pooling layer for each group
Figure 207471DEST_PATH_IMAGE040
Feature maps of size are summed by position and then divided by
Figure 523046DEST_PATH_IMAGE041
Averaging all the feature map information in a group of channels, the output size of each group of feature maps being
Figure 763272DEST_PATH_IMAGE046
. Maximum pooling of channels for each group
Figure 288931DEST_PATH_IMAGE040
The size of the feature map is maximized according to the position to obtain the peak value information of the feature map, and the output size of each group of feature maps is
Figure 305429DEST_PATH_IMAGE046
(ii) a Obtaining the feature representation of all local key areas by performing channel average pooling and channel maximum pooling on all groups, and then adding the two output results according to positions, wherein the size of the total output feature map is
Figure 209931DEST_PATH_IMAGE047
. Then, the feature map is input into a global average pooling layer to extract semantic representation of each local key region, and the output feature size is
Figure 388102DEST_PATH_IMAGE048
. Global average pooling layer is achieved by grouping each group
Figure 401058DEST_PATH_IMAGE046
All feature points in the size feature map are added and then divided by
Figure 454202DEST_PATH_IMAGE043
To obtain the semantic features of the feature map and output as
Figure 478790DEST_PATH_IMAGE020
Size; respectively calculating local losses for the n local key region characteristics, wherein the fine-grained local losses are defined as follows:
Figure 93442DEST_PATH_IMAGE021
wherein, the first and the second end of the pipe are connected with each other,
Figure 328114DEST_PATH_IMAGE002
respectively representing images, video, audio, text,
Figure 155256DEST_PATH_IMAGE022
is a label and is used as a label,
Figure 798465DEST_PATH_IMAGE023
for the probability feature, taking image media as an example,
Figure 443073DEST_PATH_IMAGE049
a fine-grained label representing a local area of the ith image,
Figure 40407DEST_PATH_IMAGE050
and the log likelihood probability of the ith image local area after passing through the global average pooling layer is represented. The loss function is determined by applying to each
Figure 936819DEST_PATH_IMAGE051
The local key region features of the size are restricted, so that the model can be helped to extract fine-grained semantic features with more discriminant lines in the region.
Meanwhile, in order to encourage the model to discover different local regions instead of letting all channels focus on only one most critical local region, the global representation of the feature map is learned through global loss, which is firstly to be
Figure 935999DEST_PATH_IMAGE044
Inputting the feature map of the size after being grouped by channels
Figure 656568DEST_PATH_IMAGE052
Calculating the probability of all feature maps in the function, expressing the output features as the weight of the feature points in each feature map, and outputting the feature with the size
Figure 600253DEST_PATH_IMAGE044
. In order to obtain the most representative feature map of each local feature, the application aims to combine each group through the maximum pooling layer of the channel
Figure 769198DEST_PATH_IMAGE041
Fitting the feature information of each feature map into one feature map, taking the maximum value of all the feature maps in the same group according to the position by the maximum pooling layer, and outputting the feature map of each group as
Figure 747518DEST_PATH_IMAGE046
. The most representative n characteristic graphs can be obtained by performing channel maximum pooling on all local key areas; finally, the correlation between these n regions is calculated by global penalty, which is defined as follows:
Figure 874874DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 181222DEST_PATH_IMAGE053
respectively representing images, videos, audios and texts, wherein n is the number of local regions, h is the length of the feature map, w is the width of the feature map, and x is each feature point on the feature map; taking the image medium as an example,
Figure 544070DEST_PATH_IMAGE054
a feature point representing an ith position on the image feature; at each oneIn the case where the channel characteristics are all very different, the upper limit of the loss is
Figure 750798DEST_PATH_IMAGE055
By optimization of the formula
Figure 49055DEST_PATH_IMAGE011
The total loss can help the model to find different local areas, and this can avoid information redundancy caused by learning only one local area by all channel features.
In this embodiment, in addition to learning fine-grained semantic features in each media using fine-grained loss, cross-media correlation between different media needs to be measured. Since data of different media types are distributed in the feature space by media type, a loss function is needed to cluster the media data together in fine-grained subcategories. The present invention uses center loss as a measure of cross-media correlation because it can set a category center for each fine-grained subcategory and then reduce the impact of the "media gap" by narrowing the distance of the data for different media types from this center.
For input size of
Figure 108278DEST_PATH_IMAGE044
The feature map of the media data is firstly extracted through a global average pooling layer to obtain the feature representation of the media data on each channel, and the output size is
Figure 743659DEST_PATH_IMAGE056
. These media features are then mapped into a high level semantic space using a linear layer that will output a 200-dimensional representation of the features (200 for the number of fine-grained subcategories) since the objects across the media loss partition the data in fine-grained subcategories. Finally, the difference between these media data is measured by the cross-media loss, which is defined as follows:
Figure 571938DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 539631DEST_PATH_IMAGE002
respectively representing images, video, audio, text,
Figure 945205DEST_PATH_IMAGE030
which represents the ith input sample is shown,
Figure 790801DEST_PATH_IMAGE031
a category center representing the ith sample; by optimizing the cross-media loss, the distance of different media data from the center of the fine-grained subcategory can be effectively reduced.
In this embodiment, the experiment uses the mAP score to test the performance of the cross-media retrieval method of channel fine-grained semantic features, first calculating the Average Precision (AP) of each media query, and then taking the average of them as the mAP score. For the two different data sets, the invention evaluates four multi-modal fine-grained cross-media retrieval performances and twelve bi-modal fine-grained cross-media retrieval performances respectively.
The mAP is calculated as follows:
as shown in FIG. 4, the present invention uses a confusion matrix to count the number of search results TP, FN, FP, and TN, wherein each column represents the positive or negative of the prediction result, the total number of each column represents the positive or negative of the prediction, each row represents the positive or negative of the true label of the data, and the total number of each row represents the number of positive or negative labels. TP (True Positive) indicates that the data label is a Positive sample and the search result is a Positive sample. FN (False Negative) indicates that the data label is a positive sample and the search result is a Negative sample. FP (False Positive) indicates that the data label is a negative sample and the search result is a Positive sample. TN (True Negative) indicates that the data label is a Negative sample, and the search result is a Negative sample.
Firstly, the accuracy P is calculated according to the confusion matrix, and the calculation formula is as follows:
Figure 208007DEST_PATH_IMAGE057
and measuring the missed detection degree of the model by calculating the recall rate of cross-media retrieval, and calculating the recall rate R according to the confusion matrix, wherein the calculation formula is as follows:
Figure 972701DEST_PATH_IMAGE058
therefore, the average retrieval accuracy AP can be calculated, and the calculation formula is as follows:
Figure 475357DEST_PATH_IMAGE059
wherein the content of the first and second substances,
Figure 357600DEST_PATH_IMAGE035
representing the function of the accuracy P about the recall rate R, wherein N is the total quantity of the searched features, rel takes the value of 0 or 1, and when the category of the input features is the same as that of the search result
Figure 160471DEST_PATH_IMAGE036
When the category of the input feature is different from the category of the search result
Figure 96066DEST_PATH_IMAGE037
And finally, calculating the average value of the AP as mAP, calculating the average retrieval precision of each class by the mAP, and then calculating the average values of all the classes, wherein the calculation formula is as follows:
Figure 86019DEST_PATH_IMAGE038
wherein Q is the number of times of retrieval; the accuracy and the recall rate of the model can be comprehensively considered by using the mAP evaluation index, and the method is very suitable for evaluating the performance of cross-media retrieval.
Because the invention adopts a uniform network as the feature extractor of four different media, the different media need to be extractedThe data was converted to the same network input size and the feature extractor used in the experiment was the Resnet50 network structure. For fair comparison, four media data of image, text, audio and video are preprocessed and converted into the same
Figure 7839DEST_PATH_IMAGE060
And inputting a feature map of the size. In order to accelerate the training process, the experiment uses pre-trained ImageNet as the initial parameters of Resnet50 network, and the fine-grained learning network and the cross-media learning process provided by the invention use random parameters for training during the first iteration. And, the mAP score is calculated in a testing phase using a feature mAP generated across linear layers in the media learning process.
In order to fully learn cross-media correlation, the batch size of a single media is set to be 8 in an experiment, then four media are mixed for training together (one batch of training has 32 training samples including 8 picture data, 8 audio data, 8 text data and 8 video data), due to the fact that the number of different media training sets is unequal, training of the text training data is completed after one round of training, the rest three media training data are not completely trained, in order to train data of different media types fairly, 4000 samples are re-randomly sampled for training in the experiment after each round of training is completed (for a PKU-FG-Xmedia data set, the text data set is the minimum, the number of the training set samples is 4000, and the number of the test set samples is 4000), meanwhile, fitting during model training can be avoided, a cosine learning rate table is adopted for training 200 rounds in the experiment, the basic learning rate is set to be 0.001, momentum attenuation is set to be 0.9, and weight attenuation is set to be 0.0001.
In order to fully verify the effectiveness of the method proposed by the present invention, experiments will be compared with several latest cross-media retrieval methods on the bimodal and multimodal domains, including GXN, SCAN, CMDN, GSPH, JRL, ACMR, MHTN, MGAH, CDCR, FGCrossNet.
Table 7.1, table 7.2 show bimodal domain fine-grained cross-media retrieval mAP scores. For fine-grained cross-media retrieval of bimodal domains, the present invention proceeds fromThe four media of pictures, videos, audios and texts are searched one by one, including,
Figure 789850DEST_PATH_IMAGE061
Figure 270248DEST_PATH_IMAGE062
Figure 747496DEST_PATH_IMAGE063
Figure 738586DEST_PATH_IMAGE064
Figure 109525DEST_PATH_IMAGE065
Figure 996709DEST_PATH_IMAGE066
Figure 85888DEST_PATH_IMAGE067
Figure 113625DEST_PATH_IMAGE068
Figure 214436DEST_PATH_IMAGE069
Figure 662735DEST_PATH_IMAGE070
Figure 114576DEST_PATH_IMAGE071
Figure 181889DEST_PATH_IMAGE072
Wherein
Figure 527420DEST_PATH_IMAGE061
The input data is represented as image features, and the search result is represented as text features. Table 7.3 shows the fine-grained cross-media retrieval mAP scores for multi-modal domains. For fine-grained cross-media retrieval in a multi-modal domain, the invention carries out the retrieval on four typesMedia is subjected to one-to-many retrieval, including
Figure 786100DEST_PATH_IMAGE073
Figure 459658DEST_PATH_IMAGE074
Figure 330662DEST_PATH_IMAGE075
Figure 530699DEST_PATH_IMAGE076
Wherein
Figure 461746DEST_PATH_IMAGE073
The input data is represented as image features, and the search result contains all media features. Cross-media retrieval across multimodal domains will output retrieval results for all media types and then sort by similarity.
From the results in tables 7.1 and 7.2, it can be seen that the CFSFL method proposed by the present invention is improved to some extent in fine-grained cross-media retrieval per one-to-one media compared to the conventional method. Because the channel characteristic learning method provided by the invention can generate local key regions for different media data and learn the fine-grained semantic characteristics of each local region, the fine-grained cross-media retrieval mAP result of the method in a bimodal domain is averagely improved by 6.7 percentage points. Wherein
Figure 144573DEST_PATH_IMAGE077
The 5 retrieval processes are improved most obviously (about 9 percent improvement), because the traditional FGCrossNet method only adopts three losses (cross entropy loss, triple loss and ranking loss) to learn the cross-media correlation and the fine-grained semantic features, which results in that the network extracted features contain a large amount of redundant information, and enough key features cannot be extracted to learn the cross-media correlation among the fine-grained semantic features.
TABLE 7.1 mAP results for bimodal fine-grained cross-media retrieval
Figure 943902DEST_PATH_IMAGE078
TABLE 7.2 mAP results for bimodal fine-grained cross-media retrieval
Figure 139391DEST_PATH_IMAGE079
However, the CFSFL method provided by the invention has a small promotion amplitude in the retrieval process between image data and video data, because the cross entropy loss is favorable for processing visual features, and the feature is mapped to 200 dimensions to form good feature alignment with fine-grained subcategory labels, thereby being favorable for learning fine-grained semantic features. This results in the present invention's use of its local loss as fine-grained learning that does not significantly improve the mAP search score between image and video, and therefore, in
Figure 710181DEST_PATH_IMAGE063
The cross-media retrieval process is improved by only 1 percent
Figure 748544DEST_PATH_IMAGE070
The cross-media retrieval process above only improved by 2.4 percentage points.
From the results in table 7.3, it can be seen that the CFSFL method proposed by the present invention is not only improved in the bimodal domain, but also has significant results when a one-to-many search process is performed in the multimodal domain. Compared with the most advanced method, the CFSFL method provided by the invention improves the average retrieval mAP score on four multi-modal fine-grained cross-media retrieval by about 6.1 percentage points.
TABLE 7.3 mAP results for multimodal fine-grained cross-media retrieval
Figure 961350DEST_PATH_IMAGE080
In order to fully verify the effectiveness of the method, the invention not only carries out experiments on cross-media retrieval with fine granularity,while also performing comparative experiments on coarse-grained cross-media retrieval. Tables 7.4 and 7.5 are coarse-grained cross-media search results on Xmedia datasets, one-to-one for image, video, audio, text media. The cross-media retrieval score of the CFSFL method provided by the invention on the bimodal domain is 0.777 on average, which is 1.8 percentage points higher than that of the mainstream FGCrossNet method. The fact that the learning method based on the channel features is improved to a certain extent on the coarse-grained data set is also shown, because the fine-grained learning method provided by the invention can learn local and global relations of key points at the same time, which can help the model to generate more effective semantic features, however, the performance of part of the media retrieval process is not as effective as that of a method specially designed based on the coarse-grained data set, for example
Figure 509881DEST_PATH_IMAGE081
These retrieval processes.
TABLE 7.4 mAP results for bimodal coarse-grained cross-media retrieval
Figure 782731DEST_PATH_IMAGE082
TABLE 7.5 mAP results for bimodal coarse-grained cross-media retrieval
Figure 42811DEST_PATH_IMAGE083
Table 7.6 shows coarse-grained cross-media retrieval mAP results for multi-modal domains. From the table, it can be found that the cross-media retrieval method based on multi-modal domain is improved to some extent in each one-to-many retrieval process. This is because the CFSFL method proposed by the present invention can learn cross-media correlations from key points in each media, which can effectively reduce noise effects in different media data. However, since the coarse-grained data sets have obvious inter-class differences, and the method of the present invention is a method specially designed for fine-grained data sets, the method of generating local regions and the method of learning fine-grained semantic representations of local regions have less influence on the coarse-grained data sets. Therefore, the method of the present invention has a limited increase in the coarse-grained cross-media retrieval task compared to the most advanced cross-media retrieval methods.
TABLE 7.6 mAP results for multimodal coarse-grained cross-media retrieval
Figure 59308DEST_PATH_IMAGE084
In order to further prove the effectiveness of the method, ablative research is carried out from the following three angles, including different feature extraction networks, different channel processing strategies, different channel feature learning methods and the effectiveness of the method proved by a large number of experiments on a fine-grained cross-media data set PKU-FG-Xmedia.
In order to effectively learn fine-grained semantic representations of various media, a general and effective feature extraction network is needed to extract features of different media data for fine-grained learning and cross-media retrieval. The impact of three different deep networks on the proposed method of the present invention, including VGG16, B-CNN and Resnet50, was studied by replacing the network in fig. 1. Tables 7.7 and 7.8 show the fine-grained cross-media retrieval results of the three networks in the dual-modal domain, and table 7.9 shows the fine-grained cross-media retrieval results of the three networks in the multi-modal domain. It can be found that the Resnet50 network structure is superior to the B-CNN network structure, and the B-CNN network structure is superior to the VGG16 network structure.
TABLE 7.7 mAP results for bimodal fine-grained cross-media retrieval of different feature extraction networks
Figure 963810DEST_PATH_IMAGE085
TABLE 7.8 mAP results for bimodal fine-grained cross-media retrieval of different feature extraction networks
Figure 906096DEST_PATH_IMAGE086
Table 7.9 multi-modal fine-grained cross-media retrieval mAP results for different feature extraction networks
Figure 919052DEST_PATH_IMAGE087
In order to fully utilize the feature information in the channel and reduce the information redundancy, the CFSFL method provided by the invention needs to divide the channel and then respectively represent different local key regions by different channel groups, so that the processing method of the channel is very important for local region generation and fine-grained semantic learning. The channel processing method provided by the invention is used for ablation research. As shown in table 7.10, the present invention will compare the three different channel processing methods. In order to fairly compare the merits of the three different methods, the remaining parameters in the experiment were kept constant. The channel dividing method provided by the invention directly divides the output of the last layer of residual block of the Resnet network into channels, because the number of the output channels is 2048 and cannot be divided by the number of classes of 200, the method of the invention randomly zeros 48 channels, then averagely divides the channels, the number of each group of channels is 10, and the channels are divided according to the number
Figure 473661DEST_PATH_IMAGE088
The 3 channel features are zeroed out and the front and back positions of the feature maps in each group are scrambled. The method 1 performs linear mapping on the output characteristics of the last layer of residual blocks of the Resnet network, maps the number of channels to 2000, and then averagely divides the channels. Method2 the Method of the invention is distinguished in that all grouped features are retained and no channel dropping operation is taken.
TABLE 7.10 different channel treatment methods
Figure 232670DEST_PATH_IMAGE089
Tables 7.11 and 7.12 show the fine-grained cross-media retrieval method in the dual-modal domain for different channel processing methods, and table 7.13 shows the fine-grained cross-media retrieval method in the multi-modal domain. It can be found that the method of directly performing channel division on the output of the last layer of residual block of the Resnet network is much better than the method of performing linear mapping, because linear mapping on different media data will lose part of the unique information of the media, which is not favorable for losing the learning of local areas and fine-grained semantic information which are special for different media with fine-grained loss. By comparing the Method2 with the CFSFL Method provided by the invention, the channel loss ratio set by the invention can be found to be improved to a certain extent in cross-media retrieval. This shows that by setting a reasonable channel loss ratio, each channel within a group can be promoted to fully learn the fine-grained feature representation of a local region.
TABLE 7.11 mAP results of different channel processing methods in bimodal fine-grained cross-media retrieval
Figure 706376DEST_PATH_IMAGE090
TABLE 7.12 mAP results of different channel processing methods in bimodal fine-grained cross-media retrieval
Figure 816415DEST_PATH_IMAGE091
TABLE 7.13 mAP results of different channel processing methods in multimodal fine-grained cross-media retrieval
Figure 673250DEST_PATH_IMAGE092
In addition, the invention also carries out ablation research on the proposed channel characteristic learning method, and the formula in the fine-grained learning method is as follows:
Figure 552344DEST_PATH_IMAGE093
and cross-media learning methods of formula:
Figure 462531DEST_PATH_IMAGE029
to carry outAblation analyzes the effectiveness of these two-part methods. The results of the fine-grained cross-media retrieval in the bimodal domain are shown in tables 7.14 and 7.15, and the fine-grained cross-media retrieval method in the multimodal domain is shown in table 7.16.
Figure 59866DEST_PATH_IMAGE094
By removing the formula in fine-grained learning:
Figure 690699DEST_PATH_IMAGE095
to study the influence of global connection between local key regions on learning fine-grained semantic features in each media, the loss can be defined as:
Figure 814512DEST_PATH_IMAGE096
Figure 269502DEST_PATH_IMAGE097
by removing the formula in cross-media learning:
Figure 88554DEST_PATH_IMAGE098
to investigate whether fine-grained cross-media retrieval can be performed simply by learning representations of fine-grained semantic features within the media and aligning with their fine-grained subcategory labels, the penalty can be defined as:
Figure 647711DEST_PATH_IMAGE099
it can be found that the CFSFL method proposed by the present invention achieves the best effect by jointly considering the local differences and global connections between local key regions in different media and the correlation between different media. By comparison
Figure 501398DEST_PATH_IMAGE094
And
Figure 894333DEST_PATH_IMAGE097
global associations of local key regions within a media can be found to be more important than measures of relevance between different media, due to the equation:
Figure 964795DEST_PATH_IMAGE100
when the method in (1) learns the discriminant features of local key regions in different media, each local region feature is aligned with the fine-grained subcategory label of the local region feature, and the method can effectively promote the data of different media categories to be distributed according to the fine-grained subcategory in the feature space.
TABLE 7.14 mAP results of different losses in bimodal fine-grained cross-media retrieval
Figure 203009DEST_PATH_IMAGE101
TABLE 7.15 mAP results for different penalties in bimodal fine-grained cross-media retrieval
Figure 35836DEST_PATH_IMAGE102
Table 7.16 mAP results of different losses in multimodal fine-grained cross-media retrieval
Figure 334093DEST_PATH_IMAGE103
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims (8)

1. A cross-media retrieval method based on channel fine-grained semantic features is characterized by comprising the following steps:
s1, firstly, generating a feature map with rich channel information through a deep network;
s2, dividing according to channels and inputting the divided data into a fine-grained learning layer and a cross-media learning layer;
s21, respectively inputting four media data, namely image data, video data, audio data and text data, into a fine-grained learning layer to learn fine-grained discrimination characteristics, and outputting the fine-grained discrimination characteristics as fine-grained loss;
s22, jointly inputting the four media data in the S21 to a cross-media learning layer to learn cross-media correlation, and outputting cross-media loss;
s3, adding results of the fine-grained loss function and the cross-media loss function to obtain cross-media joint loss;
and S4, evaluating the performance of the retrieval method by adopting an experiment.
2. The cross-media retrieval method based on the channel fine-grained semantic features as claimed in claim 1, wherein in S2, in the fine-grained cross-media retrieval task, input data contains four media of images, audio, video and text; training by adopting a multimedia mixed input method, and performing combined input by equally sampling different media data, wherein the network input is
Figure 97491DEST_PATH_IMAGE001
In which
Figure 861179DEST_PATH_IMAGE002
Representing images, video, audio, text,
Figure 71580DEST_PATH_IMAGE003
labels representing them; extracting feature diagram of high-dimensional channel of media type data by using uniform network, and outputting the feature diagram
Figure 1228DEST_PATH_IMAGE004
In which
Figure 271672DEST_PATH_IMAGE005
The feature vector of size, c the number of channels, h the length of the feature map, and w the width of the feature map.
3. The method of claim 2, wherein four different media data are divided according to channels in the output features of the feature extractor, each channel represents a different feature region with fine-grained discriminant, the channels of the four media data are equally divided into n groups, and each group has a feature size of
Figure 912869DEST_PATH_IMAGE006
By for each group
Figure 677694DEST_PATH_IMAGE007
In one channel
Figure 212581DEST_PATH_IMAGE008
Of a number of channels
Figure 903194DEST_PATH_IMAGE009
All the feature vectors in the big and small feature maps are set to zero and all the groups are scrambled
Figure 31687DEST_PATH_IMAGE007
The front and back spatial positions of the feature map of each channel; measuring the grouped feature graphs through cross-media joint loss, wherein a loss function consists of fine-grained loss and cross-media loss; the cross-media joint loss is defined as follows:
Figure 583891DEST_PATH_IMAGE010
wherein, the first and the second end of the pipe are connected with each other,
Figure 989596DEST_PATH_IMAGE011
fine grain loss of each media data, including image data, video data, audio data, text data;
Figure 336263DEST_PATH_IMAGE012
in order to be lost across the medium,
Figure 952052DEST_PATH_IMAGE013
is a hyper-parameter for controlling the degree of influence across media losses.
4. The cross-media retrieval method based on the channel fine-grained semantic features as claimed in claim 2, wherein in S21, the fine-grained semantic features of four media data, namely, image, video, audio and text, are extracted and used for loss measurement by learning the global and local relations between fine-grained local key regions, and the fine-grained loss of the media data is defined as follows:
Figure 822794DEST_PATH_IMAGE014
wherein the content of the first and second substances,
Figure 332273DEST_PATH_IMAGE015
which represents a local loss of fine-grained,
Figure 990787DEST_PATH_IMAGE016
representing the fine-grained global penalty, and m represents the weight of the global penalty term.
5. The method of claim 4, wherein the cross-media search method based on the channel fine-grained semantic features is characterized in that all feature maps in each group of channels are subjected to channel average pooling and channel maximum pooling, and the channel average pooling layer is used for each group of channels
Figure 703660DEST_PATH_IMAGE006
Feature maps of size are summed by position and then divided by
Figure 863245DEST_PATH_IMAGE007
The output size of each group of feature maps is
Figure 368176DEST_PATH_IMAGE017
(ii) a Maximum pooling of channels for each group
Figure 571493DEST_PATH_IMAGE006
The size of the feature map is maximum according to position, and the output size of each group of feature maps is
Figure 20929DEST_PATH_IMAGE017
(ii) a Obtaining the feature representation of all local key areas by performing channel average pooling and channel maximum pooling on all groups, and then adding the two output results according to positions, wherein the size of the total output feature map is
Figure 469359DEST_PATH_IMAGE018
(ii) a Then, the feature map is input into a global average pooling layer to extract semantic representation of each local key region, and the output feature size is
Figure 828796DEST_PATH_IMAGE019
(ii) a Global average pooling layer is achieved by grouping each group
Figure 953747DEST_PATH_IMAGE017
All feature points in the size feature map are added and then divided by
Figure 405326DEST_PATH_IMAGE009
To obtain the semantic features of the feature map and output as
Figure 516501DEST_PATH_IMAGE020
Size; fine-grained local loss is respectively calculated for the n local key region characteristics, and is defined as follows:
Figure 855079DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 167243DEST_PATH_IMAGE002
respectively representing images, video, audio, text,
Figure 466637DEST_PATH_IMAGE022
is a label, and is a label,
Figure 506137DEST_PATH_IMAGE023
is a probabilistic feature.
6. The method of claim 5, wherein the global representation of the feature map is learned through global loss, and firstly, the method is to use
Figure 948489DEST_PATH_IMAGE024
Inputting the feature map of the size after being grouped by channels
Figure 556188DEST_PATH_IMAGE025
Calculating the probability of all feature maps in the function, expressing the output features as the weight of the feature points in each feature map, and outputting the feature with the size
Figure 467512DEST_PATH_IMAGE024
(ii) a In order to obtain the characteristic graph most representative of each local characteristic, each group is divided into a plurality of groups through a channel maximum pooling layer
Figure 61435DEST_PATH_IMAGE007
Characteristics of the characteristic diagramFitting the information into a characteristic diagram, taking the maximum value of all the characteristic diagrams in the same group according to the position by the maximum pooling layer, and outputting the characteristic diagram of each group as
Figure 109026DEST_PATH_IMAGE017
(ii) a The most representative n characteristic graphs can be obtained by performing channel maximum pooling on all local key areas; finally, the correlation between the n regions is calculated by fine-grained global penalties, which are defined as follows:
Figure 622047DEST_PATH_IMAGE026
wherein the content of the first and second substances,
Figure 535514DEST_PATH_IMAGE002
respectively representing images, videos, audios and texts, wherein n is the number of local regions, h is the length of the feature map, w is the width of the feature map, and x is each feature point on the feature map.
7. The method of claim 2, wherein in S22, the input size is set to
Figure 916816DEST_PATH_IMAGE027
The feature representation of the media data on each channel is extracted through a global average pooling layer, and the output size is
Figure 959859DEST_PATH_IMAGE028
(ii) a The difference between these media data is measured by cross-media loss, defined as follows:
Figure 519147DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 670643DEST_PATH_IMAGE002
respectively representing images, video, audio, text,
Figure 465424DEST_PATH_IMAGE030
which represents the ith input sample, is then,
Figure 736874DEST_PATH_IMAGE031
the category center of the ith sample is represented.
8. The cross-media retrieval method based on the channel fine-grained semantic features as claimed in claim 1, characterized in that the performance of the cross-media retrieval method based on the channel fine-grained semantic features is tested experimentally using the mAP score, and the specific calculation method is as follows:
counting the number of search results TP, FN, FP and TN by using a confusion matrix, wherein TP represents that a data label is a positive sample, and the search result is the positive sample; FN represents that the data label is a positive sample, the retrieval result is a negative sample, FP represents that the data label is a negative sample, the retrieval result is a positive sample, TN represents that the data label is a negative sample, and the retrieval result is a negative sample;
firstly, the accuracy rate P is calculated according to the confusion matrix, and the calculation formula is as follows:
Figure 981910DEST_PATH_IMAGE032
and measuring the missed-detection degree of the model by calculating the recall rate of cross-media retrieval, and calculating the recall rate R according to the confusion matrix, wherein the calculation formula is as follows:
Figure 230489DEST_PATH_IMAGE033
therefore, the average retrieval accuracy AP can be calculated, and the calculation formula is as follows:
Figure 704327DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure 846595DEST_PATH_IMAGE035
representing the function of the accuracy P about the recall rate R, N is the total quantity of the searched features, rel takes the value of 0 or 1, and when the category of the input features is the same as that of the search result
Figure 246221DEST_PATH_IMAGE036
When the category of the input feature is different from the category of the search result
Figure 982096DEST_PATH_IMAGE037
And finally, calculating the average value of the AP as mAP, calculating the average retrieval precision of each class by the mAP, and then calculating the average values of all the classes, wherein the calculation formula is as follows:
Figure 243313DEST_PATH_IMAGE038
wherein Q is the number of searches.
CN202211417363.6A 2022-11-14 2022-11-14 Cross-media retrieval method based on channel fine-grained semantic features Pending CN115630178A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211417363.6A CN115630178A (en) 2022-11-14 2022-11-14 Cross-media retrieval method based on channel fine-grained semantic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211417363.6A CN115630178A (en) 2022-11-14 2022-11-14 Cross-media retrieval method based on channel fine-grained semantic features

Publications (1)

Publication Number Publication Date
CN115630178A true CN115630178A (en) 2023-01-20

Family

ID=84910101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211417363.6A Pending CN115630178A (en) 2022-11-14 2022-11-14 Cross-media retrieval method based on channel fine-grained semantic features

Country Status (1)

Country Link
CN (1) CN115630178A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651660A (en) * 2020-05-28 2020-09-11 拾音智能科技有限公司 Method for cross-media retrieval of difficult samples
US20210349954A1 (en) * 2020-04-14 2021-11-11 Naver Corporation System and method for performing cross-modal information retrieval using a neural network using learned rank images
CN113779283A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Fine-grained cross-media retrieval method with deep supervision and feature fusion
CN113779284A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Method for constructing entity-level public feature space based on fine-grained cross-media retrieval
CN113792167A (en) * 2021-11-11 2021-12-14 南京码极客科技有限公司 Cross-media cross-retrieval method based on attention mechanism and modal dependence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210349954A1 (en) * 2020-04-14 2021-11-11 Naver Corporation System and method for performing cross-modal information retrieval using a neural network using learned rank images
CN111651660A (en) * 2020-05-28 2020-09-11 拾音智能科技有限公司 Method for cross-media retrieval of difficult samples
CN113779283A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Fine-grained cross-media retrieval method with deep supervision and feature fusion
CN113779284A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Method for constructing entity-level public feature space based on fine-grained cross-media retrieval
CN113792167A (en) * 2021-11-11 2021-12-14 南京码极客科技有限公司 Cross-media cross-retrieval method based on attention mechanism and modal dependence

Similar Documents

Publication Publication Date Title
Hui et al. PACRR: A position-aware neural IR model for relevance matching
CN107515895B (en) Visual target retrieval method and system based on target detection
CN110851645B (en) Image retrieval method based on similarity maintenance under deep metric learning
Mitra et al. Learning to match using local and distributed representations of text for web search
Gao et al. Database saliency for fast image retrieval
CN106202256B (en) Web image retrieval method based on semantic propagation and mixed multi-instance learning
CN105718532B (en) A kind of across media sort methods based on more depth network structures
US20180018566A1 (en) Finding k extreme values in constant processing time
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
CN110866134B (en) Image retrieval-oriented distribution consistency keeping metric learning method
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN107180087B (en) A kind of searching method and device
Ravichandran et al. Statistical QA-Classifier vs. Re-ranker: What’s the difference?
Shi et al. Semi-supervised acoustic event detection based on tri-training
Sanchez et al. Fast trajectory clustering using hashing methods
Ye et al. Query-adaptive remote sensing image retrieval based on image rank similarity and image-to-query class similarity
Hui et al. A position-aware deep model for relevance matching in information retrieval
Sujana et al. Rumor detection on Twitter using multiloss hierarchical BiLSTM with an attenuation factor
CN105760875A (en) Binary image feature similarity discrimination method based on random forest algorithm
CN115309860A (en) False news detection method based on pseudo twin network
CN105701501B (en) A kind of trademark image recognition methods
CN113220915B (en) Remote sensing image retrieval method and device based on residual attention
CN107423319B (en) Junk web page detection method
Zhang et al. Exploring uni-modal feature learning on entities and relations for remote sensing cross-modal text-image retrieval
Bretan et al. Learning and evaluating musical features with deep autoencoders

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20230120