CN115630178A

CN115630178A - Cross-media retrieval method based on channel fine-grained semantic features

Info

Publication number: CN115630178A
Application number: CN202211417363.6A
Authority: CN
Inventors: 姚亚洲; 沈复民; 孙泽人; 陈涛; 白泞玮
Original assignee: Nanjing Code Geek Technology Co ltd
Current assignee: Nanjing Code Geek Technology Co ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-01-20

Abstract

The invention discloses a cross-media retrieval method based on channel fine-grained semantic features, which comprises the following steps of S1, firstly, generating a feature map with rich channel information through a deep network; s2, dividing according to channels and inputting the divided parts into a fine-grained learning layer and a cross-media learning layer; and S3, adding the results of the fine-grained loss function and the cross-media loss function to obtain cross-media joint loss. The method comprises the steps of grouping channels to represent each local key area, generating different local key areas by using global loss, learning fine-grained semantic features of each local key area by using local loss, and measuring correlation among different media data by using cross-media loss; compared with the traditional cross-media retrieval method, the method can simultaneously learn the fine-grained semantic features of different media for cross-media retrieval, and can avoid high training cost caused by designing a special network for each type of media data.

Description

Cross-media retrieval method based on channel fine-grained semantic features

Technical Field

The invention relates to the technical field of cross-media retrieval, in particular to a cross-media retrieval method based on channel fine-grained semantic features.

Background

In the past few years, unsupervised fine-grained feature extraction methods have been widely studied, which aim to extract discriminant local key regions in a feature map and then perform end-to-end training by learning the relationship between different local key regions in the same input data and the difference between local key regions in different input data. The model is typically divided into two sub-networks during the training phase, the first network being used to generate local key regions, and the second network learning fine-grained semantic features between key regions.

Although the fine-grained feature extraction network structure based on the local key region only needs image-level labeling, the network training method is similar to supervised learning, and higher model complexity and training difficulty are needed. Therefore, extraction of fine-grained features of different media and learning of cross-media correlation using these methods will bring unworkable training time and model complexity.

With the development of CNN, researchers can make features of the same class have compact classes and make features of different classes sparse by only designing a loss function of a specific task according to the characteristics of a fine-grained data set that the intra-class variance is large and the inter-class variance is small. For example, in the 'a discrete feature acquisition for depth surface registration' paper of the European conference on computer vision conference of 2016, a center loss is proposed, and the distance of each feature from the center point of a category is measured by setting the center point for the category and updating the position of the center point in each iteration, which can effectively distribute the feature sets having the same category together. Although the methods do not need to design a complex network structure and can obtain the discrimination information of a fine granularity level only by optimizing the loss function, the methods are very sensitive to the training data containing noise because the local key area of the target is not extracted. Because the data of the same category in the fine-grained cross-media data set contains four different media, namely images, videos, audios and texts, if local key regions of the media data are not extracted, the fine-grained semantic features of the media data are directly learned, and the model is easily influenced by noise of different media data to cause low convergence speed or even no convergence.

In contrast, in The device is in The channels of The IEEE Transactions on Image Processing journal in 2020, the correlation between fine-grained local regions is studied on The channels of The feature map, and they divide The number of channels uniformly into groups, and each group of channel features represents a class, so as to perform fine-grained classification of images.

Disclosure of Invention

Inspired by the research, the invention provides a cross-media retrieval method CFSFL (channel fine-grained-semantic feature) based on channel fine-grained semantic features, which is used for generating local key regions to learn fine-grained semantic representations and cross-media correlation of different media features.

In order to achieve the purpose, the invention provides the following technical scheme: a cross-media retrieval method based on channel fine-grained semantic features comprises the following steps:

s1, firstly, generating a characteristic diagram with rich channel information through a deep network;

s2, dividing according to channels and inputting the divided parts into a fine-grained learning layer and a cross-media learning layer;

s21, inputting four media data, namely image data, video data, audio data and text data, into a fine-grained learning layer respectively to learn fine-grained distinguishing characteristics, and outputting the fine-grained distinguishing characteristics as fine-grained loss;

s22, jointly inputting the four media data in the S21 to a cross-media learning layer to learn cross-media correlation, and outputting cross-media loss;

and S3, adding the results of the fine-grained loss function and the cross-media loss function to obtain cross-media combined loss.

Further, in S2, in the fine-grained cross-media retrieval task, the input data includes four media, namely, an image, an audio, a video, and a text; training by adopting a multimedia mixed input method, and performing combined input by equally sampling different media data, wherein the network input is

In which

Representing images, video, audio, text,

labels representing them; extracting feature diagram of high-dimensional channel of media type data by using uniform network, and outputting the feature diagram

In which

The feature vector of size, c the number of channels, h the length of the feature map, and w the width of the feature map.

Furthermore, in the output characteristics of the characteristic extractor, four different media data are divided according to channels, each group of channels represents different characteristic regions with fine-grained discriminability, the channels of the four media data are equally divided into n groups, and the characteristic size of each group is

By for each group

In one channel

Of a number of channels

All the eigenvectors in the big-small feature map are set to zero and all the groups are scrambled

The front and back spatial positions of the feature map of each channel; measuring the grouped feature graph through cross-media joint loss, wherein a loss function consists of fine-grained loss and cross-media loss; the cross-media joint loss is defined as follows:

wherein, the first and the second end of the pipe are connected with each other,

fine grain loss of each media data, including image data, video data, audio data, text data;

in order to be lost across the medium,

is a hyper-parameter for controlling the degree of influence across media losses.

Further, in S21, fine-grained semantic features of four media data, namely, an image, a video, an audio and a text, are extracted and used for loss measurement by learning global and local relations between fine-grained local key regions, and fine-grained loss of the media data is defined as follows:

wherein the content of the first and second substances,

which represents a local loss of fine-grained,

representing the fine-grained global penalty, and m represents the weight of the global penalty term.

Further, performing channel average pooling and channel maximum pooling on all feature maps in each group of channels, wherein the channel average pooling layer is performed on each group of channels

The size feature maps are added together by position and then divided by

The output size of each group of feature maps is

(ii) a Maximum pooling of channels for each group

The size of the feature map is maximum according to position, and the output size of each group of feature maps is

(ii) a Obtaining the feature representation of all local key areas by performing channel average pooling and channel maximum pooling on all groups, and then adding the two output results according to positions, wherein the size of the total output feature map is

(ii) a Then, the feature map is input into a global average pooling layer to extract semantic representation of each local key region, and the output feature size is

(ii) a Global average pooling layer is achieved by grouping each group

All feature points in the size feature map are added and then divided by

To obtain the semantic features of the feature map and output as

Size; respectively calculating local losses for the n local key region characteristics, wherein the fine-grained local losses are defined as follows:

wherein the content of the first and second substances,

respectively representing images, video, audio, text,

is a label, and is a label,

are probabilistic features.

Further, the global representation of the feature map is learned by global penalties, which will first be

The feature map of the size is input after being grouped by channels

Calculating the probability of all feature maps in the function, expressing the output features as the weight of the feature points in each feature map, and outputting the feature with the size

(ii) a In order to obtain the most representative feature map of each local feature, each group is processed by the maximum pooling layer of the channel

Fitting the feature information of each feature map into one feature map, taking the maximum value of all the feature maps in the same group according to the position by the maximum pooling layer, and outputting the feature map of each group as

(ii) a The most representative n characteristic graphs can be obtained by performing channel maximum pooling on all local key areas; finally, the correlation between these n regions is calculated by global penalty, which is defined as follows:

wherein the content of the first and second substances,

respectively representing images, videos, audios and texts, wherein n is the number of local regions, h is the length of the feature map, w is the width of the feature map, and x is each feature point on the feature map.

Further, in S22, for the input size is

The feature representation of the media data on each channel is extracted through a global average pooling layer, and the output size is

(ii) a The difference between these media data is measured by cross-media loss, which is defined as follows:

wherein the content of the first and second substances,

respectively representing images, video, audio, text,

which represents the ith input sample, is then,

the category center of the ith sample is represented.

Furthermore, in S4, the performance of the cross-media retrieval method for channel fine-grained semantic features is examined by using the mep score in an experiment, and the specific calculation method is as follows:

counting the number of search results TP, FN, FP and TN by using a confusion matrix, wherein TP represents that a data label is a positive sample, and the search result is the positive sample; FN represents that the data label is a positive sample, the retrieval result is a negative sample, FP represents that the data label is a negative sample, the retrieval result is a positive sample, TN represents that the data label is a negative sample, and the retrieval result is a negative sample;

firstly, the accuracy P is calculated according to the confusion matrix, and the calculation formula is as follows:

and measuring the missed detection degree of the model by calculating the recall rate of cross-media retrieval, and calculating the recall rate R according to the confusion matrix, wherein the calculation formula is as follows:

therefore, the average retrieval accuracy AP can be calculated, and the calculation formula is as follows:

wherein the content of the first and second substances,

representing the function of the accuracy P about the recall rate R, N is the total quantity of the searched features, rel takes the value of 0 or 1, and when the category of the input features is the same as that of the search result

When the category of the input feature is different from the category of the search result

；

And finally, calculating the average value of the AP as mAP, calculating the average retrieval precision of each class by the mAP, and then calculating the average value of all the classes, wherein the calculation formula is as follows:

wherein Q is the number of searches.

Compared with the prior art, the invention has the beneficial effects that: according to the method, channels are grouped to represent each local key area, global loss is used for generating different local key areas, then the local loss is used for learning fine-grained semantic features of each local key area, and finally cross-media loss is used for measuring correlation among different media data; compared with the traditional fine-grained feature learning method, the method is simpler, more convenient and more flexible, can automatically generate the required local key area, and does not need to design high calculation complexity brought by a local area positioning network; meanwhile, compared with the traditional cross-media retrieval method, the method can simultaneously learn the fine-grained semantic features of different media for cross-media retrieval, and can avoid high training cost caused by designing a special network for each type of media data. A large number of experiments and ablative studies verify the effectiveness of the method of the invention.

Drawings

FIG. 1 is a schematic diagram of a fine-grained cross-media retrieval network structure according to the present invention;

FIG. 2 is a schematic diagram of a local key area of the present invention;

FIG. 3 is a schematic diagram of fine-grained learning according to the present invention;

FIG. 4 is a schematic diagram of a confusion matrix according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 2, fig. 3 and fig. 4, the present invention is a cross-media retrieval method based on channel fine-grained semantic features, which includes the following steps:

s1, firstly, generating a feature map with rich channel information through a deep network;

and S3, adding the results of the fine-grained loss function and the cross-media loss function to obtain cross-media joint loss.

In this embodiment, in S2, in the fine-grained cross-media retrieval task, the input data includes four media, namely, image, audio, video, and text; in order to facilitate learning of cross-media correlation, a multimedia mixed input method is adopted for training, different media data are equally sampled for joint input, and network input is

Wherein

Representing images, video, audio, text,

labels representing them; in order to reduce the number of model parameters, the application adopts a uniform Resnet50 network (not comprising the final average pooling layer and the full connection layer) to extract the feature maps of the high-dimensional channels of the four media type data, and because the network does not comprise any linear layer, the output of the network contains rich semantic information unique to the media, the output is,

in which

Then, four different media data are divided according to channels in the output characteristics of the characteristic extractor, each group of channels represents different characteristic areas with fine-grained discriminant, such as image data represented as local key areas, text data represented as keyword vectors in texts, video represented as local key areas of the current frame, and audio data represented as characteristics with discriminant in spectrogram. The channel division mode can avoid the calculation complexity brought by designing a local area feature extraction network.

As shown in fig. 2, different channels of the features correspond to local regions with different targets, and the channels of the four media data are equally divided into n groups, and each group has a feature size of

Where the ungrouped features are all zeroed. In order to reduce the amount of calculation and increase the generalization capability of the model, dropout operation needs to be performed on the grouped features.

Experiment ofBy for each group

In one channel

Of a number of channels

All the feature vectors in the big and small feature maps are set to zero and all the groups are scrambled

The front and back spatial positions of the feature map of each channel. Unlike the way in which weights are computed for all channels in each group, the method of the present invention randomly discards some channels during training, which allows all feature maps in the same group to try to learn local key region information, and then performs a loss metric by fitting all feature maps in the same group of channels to obtain sufficient local key features.

Finally, the grouped characteristic graph is measured through the cross-media joint loss provided by the invention, and the loss function consists of fine-grained loss and cross-media loss. The fine-grained penalties are used to learn distinctions between these local key features to generate learned fine-grained discriminative information, and the associations between these local key features can be learned to generate local key regions. Cross-media loss learns cross-media relevance by measuring the difference between global features. The cross-media joint loss is defined as follows:

wherein the content of the first and second substances,

in order to be lost across the medium,

In this embodiment, as shown in fig. 3, a main flow of fine-grained learning proposed in the present application is shown in detail, in which the method extracts fine-grained semantic features of four media data, namely, images, videos, audios, and texts, and uses the extracted fine-grained semantic features to perform loss measurement by learning global and local relationships between fine-grained local key regions, and total loss is defined as follows:

wherein the content of the first and second substances,

which represents a local loss of fine granularity,

The method provided by the application divides the whole feature map into n groups to respectively extract fine-grained semantic representations of the n local key regions, and the feature map in each group of channels represents one local key region. To be provided with

Size of feature map is taken as an example, after grouping the feature map by channels, the size of the feature map is

The channel feature map of (a) is represented as a local key region, wherein c is the number of channels, h is the length of the feature map, w is the width of the feature map, and n is the number of local key regions to be learned (usually

200 is the number of fine-grained classifications).

In order to effectively extract fine-grained semantic features in each local key area, performing channel average pooling and channel maximum pooling on all feature maps in each group of channels; channel average pooling layer for each group

Feature maps of size are summed by position and then divided by

Averaging all the feature map information in a group of channels, the output size of each group of feature maps being

. Maximum pooling of channels for each group

The size of the feature map is maximized according to the position to obtain the peak value information of the feature map, and the output size of each group of feature maps is

. Then, the feature map is input into a global average pooling layer to extract semantic representation of each local key region, and the output feature size is

. Global average pooling layer is achieved by grouping each group

All feature points in the size feature map are added and then divided by

To obtain the semantic features of the feature map and output as

respectively representing images, video, audio, text,

is a label and is used as a label,

for the probability feature, taking image media as an example,

a fine-grained label representing a local area of the ith image,

and the log likelihood probability of the ith image local area after passing through the global average pooling layer is represented. The loss function is determined by applying to each

The local key region features of the size are restricted, so that the model can be helped to extract fine-grained semantic features with more discriminant lines in the region.

Meanwhile, in order to encourage the model to discover different local regions instead of letting all channels focus on only one most critical local region, the global representation of the feature map is learned through global loss, which is firstly to be

Inputting the feature map of the size after being grouped by channels

. In order to obtain the most representative feature map of each local feature, the application aims to combine each group through the maximum pooling layer of the channel

. The most representative n characteristic graphs can be obtained by performing channel maximum pooling on all local key areas; finally, the correlation between these n regions is calculated by global penalty, which is defined as follows:

wherein the content of the first and second substances,

respectively representing images, videos, audios and texts, wherein n is the number of local regions, h is the length of the feature map, w is the width of the feature map, and x is each feature point on the feature map; taking the image medium as an example,

a feature point representing an ith position on the image feature; at each oneIn the case where the channel characteristics are all very different, the upper limit of the loss is

By optimization of the formula

The total loss can help the model to find different local areas, and this can avoid information redundancy caused by learning only one local area by all channel features.

In this embodiment, in addition to learning fine-grained semantic features in each media using fine-grained loss, cross-media correlation between different media needs to be measured. Since data of different media types are distributed in the feature space by media type, a loss function is needed to cluster the media data together in fine-grained subcategories. The present invention uses center loss as a measure of cross-media correlation because it can set a category center for each fine-grained subcategory and then reduce the impact of the "media gap" by narrowing the distance of the data for different media types from this center.

For input size of

The feature map of the media data is firstly extracted through a global average pooling layer to obtain the feature representation of the media data on each channel, and the output size is

. These media features are then mapped into a high level semantic space using a linear layer that will output a 200-dimensional representation of the features (200 for the number of fine-grained subcategories) since the objects across the media loss partition the data in fine-grained subcategories. Finally, the difference between these media data is measured by the cross-media loss, which is defined as follows:

wherein the content of the first and second substances,

respectively representing images, video, audio, text,

which represents the ith input sample is shown,

a category center representing the ith sample; by optimizing the cross-media loss, the distance of different media data from the center of the fine-grained subcategory can be effectively reduced.

In this embodiment, the experiment uses the mAP score to test the performance of the cross-media retrieval method of channel fine-grained semantic features, first calculating the Average Precision (AP) of each media query, and then taking the average of them as the mAP score. For the two different data sets, the invention evaluates four multi-modal fine-grained cross-media retrieval performances and twelve bi-modal fine-grained cross-media retrieval performances respectively.

The mAP is calculated as follows:

as shown in FIG. 4, the present invention uses a confusion matrix to count the number of search results TP, FN, FP, and TN, wherein each column represents the positive or negative of the prediction result, the total number of each column represents the positive or negative of the prediction, each row represents the positive or negative of the true label of the data, and the total number of each row represents the number of positive or negative labels. TP (True Positive) indicates that the data label is a Positive sample and the search result is a Positive sample. FN (False Negative) indicates that the data label is a positive sample and the search result is a Negative sample. FP (False Positive) indicates that the data label is a negative sample and the search result is a Positive sample. TN (True Negative) indicates that the data label is a Negative sample, and the search result is a Negative sample.

wherein the content of the first and second substances,

representing the function of the accuracy P about the recall rate R, wherein N is the total quantity of the searched features, rel takes the value of 0 or 1, and when the category of the input features is the same as that of the search result

；

And finally, calculating the average value of the AP as mAP, calculating the average retrieval precision of each class by the mAP, and then calculating the average values of all the classes, wherein the calculation formula is as follows:

wherein Q is the number of times of retrieval; the accuracy and the recall rate of the model can be comprehensively considered by using the mAP evaluation index, and the method is very suitable for evaluating the performance of cross-media retrieval.

Because the invention adopts a uniform network as the feature extractor of four different media, the different media need to be extractedThe data was converted to the same network input size and the feature extractor used in the experiment was the Resnet50 network structure. For fair comparison, four media data of image, text, audio and video are preprocessed and converted into the same

And inputting a feature map of the size. In order to accelerate the training process, the experiment uses pre-trained ImageNet as the initial parameters of Resnet50 network, and the fine-grained learning network and the cross-media learning process provided by the invention use random parameters for training during the first iteration. And, the mAP score is calculated in a testing phase using a feature mAP generated across linear layers in the media learning process.

In order to fully learn cross-media correlation, the batch size of a single media is set to be 8 in an experiment, then four media are mixed for training together (one batch of training has 32 training samples including 8 picture data, 8 audio data, 8 text data and 8 video data), due to the fact that the number of different media training sets is unequal, training of the text training data is completed after one round of training, the rest three media training data are not completely trained, in order to train data of different media types fairly, 4000 samples are re-randomly sampled for training in the experiment after each round of training is completed (for a PKU-FG-Xmedia data set, the text data set is the minimum, the number of the training set samples is 4000, and the number of the test set samples is 4000), meanwhile, fitting during model training can be avoided, a cosine learning rate table is adopted for training 200 rounds in the experiment, the basic learning rate is set to be 0.001, momentum attenuation is set to be 0.9, and weight attenuation is set to be 0.0001.

In order to fully verify the effectiveness of the method proposed by the present invention, experiments will be compared with several latest cross-media retrieval methods on the bimodal and multimodal domains, including GXN, SCAN, CMDN, GSPH, JRL, ACMR, MHTN, MGAH, CDCR, FGCrossNet.

Table 7.1, table 7.2 show bimodal domain fine-grained cross-media retrieval mAP scores. For fine-grained cross-media retrieval of bimodal domains, the present invention proceeds fromThe four media of pictures, videos, audios and texts are searched one by one, including,

、

、

、

、

、

、

、

、

、

、

、

Wherein

The input data is represented as image features, and the search result is represented as text features. Table 7.3 shows the fine-grained cross-media retrieval mAP scores for multi-modal domains. For fine-grained cross-media retrieval in a multi-modal domain, the invention carries out the retrieval on four typesMedia is subjected to one-to-many retrieval, including

、

、

、

Wherein

The input data is represented as image features, and the search result contains all media features. Cross-media retrieval across multimodal domains will output retrieval results for all media types and then sort by similarity.

From the results in tables 7.1 and 7.2, it can be seen that the CFSFL method proposed by the present invention is improved to some extent in fine-grained cross-media retrieval per one-to-one media compared to the conventional method. Because the channel characteristic learning method provided by the invention can generate local key regions for different media data and learn the fine-grained semantic characteristics of each local region, the fine-grained cross-media retrieval mAP result of the method in a bimodal domain is averagely improved by 6.7 percentage points. Wherein

The 5 retrieval processes are improved most obviously (about 9 percent improvement), because the traditional FGCrossNet method only adopts three losses (cross entropy loss, triple loss and ranking loss) to learn the cross-media correlation and the fine-grained semantic features, which results in that the network extracted features contain a large amount of redundant information, and enough key features cannot be extracted to learn the cross-media correlation among the fine-grained semantic features.

TABLE 7.1 mAP results for bimodal fine-grained cross-media retrieval

TABLE 7.2 mAP results for bimodal fine-grained cross-media retrieval

However, the CFSFL method provided by the invention has a small promotion amplitude in the retrieval process between image data and video data, because the cross entropy loss is favorable for processing visual features, and the feature is mapped to 200 dimensions to form good feature alignment with fine-grained subcategory labels, thereby being favorable for learning fine-grained semantic features. This results in the present invention's use of its local loss as fine-grained learning that does not significantly improve the mAP search score between image and video, and therefore, in

The cross-media retrieval process is improved by only 1 percent

The cross-media retrieval process above only improved by 2.4 percentage points.

From the results in table 7.3, it can be seen that the CFSFL method proposed by the present invention is not only improved in the bimodal domain, but also has significant results when a one-to-many search process is performed in the multimodal domain. Compared with the most advanced method, the CFSFL method provided by the invention improves the average retrieval mAP score on four multi-modal fine-grained cross-media retrieval by about 6.1 percentage points.

TABLE 7.3 mAP results for multimodal fine-grained cross-media retrieval

In order to fully verify the effectiveness of the method, the invention not only carries out experiments on cross-media retrieval with fine granularity,while also performing comparative experiments on coarse-grained cross-media retrieval. Tables 7.4 and 7.5 are coarse-grained cross-media search results on Xmedia datasets, one-to-one for image, video, audio, text media. The cross-media retrieval score of the CFSFL method provided by the invention on the bimodal domain is 0.777 on average, which is 1.8 percentage points higher than that of the mainstream FGCrossNet method. The fact that the learning method based on the channel features is improved to a certain extent on the coarse-grained data set is also shown, because the fine-grained learning method provided by the invention can learn local and global relations of key points at the same time, which can help the model to generate more effective semantic features, however, the performance of part of the media retrieval process is not as effective as that of a method specially designed based on the coarse-grained data set, for example

These retrieval processes.

TABLE 7.4 mAP results for bimodal coarse-grained cross-media retrieval

TABLE 7.5 mAP results for bimodal coarse-grained cross-media retrieval

Table 7.6 shows coarse-grained cross-media retrieval mAP results for multi-modal domains. From the table, it can be found that the cross-media retrieval method based on multi-modal domain is improved to some extent in each one-to-many retrieval process. This is because the CFSFL method proposed by the present invention can learn cross-media correlations from key points in each media, which can effectively reduce noise effects in different media data. However, since the coarse-grained data sets have obvious inter-class differences, and the method of the present invention is a method specially designed for fine-grained data sets, the method of generating local regions and the method of learning fine-grained semantic representations of local regions have less influence on the coarse-grained data sets. Therefore, the method of the present invention has a limited increase in the coarse-grained cross-media retrieval task compared to the most advanced cross-media retrieval methods.

TABLE 7.6 mAP results for multimodal coarse-grained cross-media retrieval

In order to further prove the effectiveness of the method, ablative research is carried out from the following three angles, including different feature extraction networks, different channel processing strategies, different channel feature learning methods and the effectiveness of the method proved by a large number of experiments on a fine-grained cross-media data set PKU-FG-Xmedia.

In order to effectively learn fine-grained semantic representations of various media, a general and effective feature extraction network is needed to extract features of different media data for fine-grained learning and cross-media retrieval. The impact of three different deep networks on the proposed method of the present invention, including VGG16, B-CNN and Resnet50, was studied by replacing the network in fig. 1. Tables 7.7 and 7.8 show the fine-grained cross-media retrieval results of the three networks in the dual-modal domain, and table 7.9 shows the fine-grained cross-media retrieval results of the three networks in the multi-modal domain. It can be found that the Resnet50 network structure is superior to the B-CNN network structure, and the B-CNN network structure is superior to the VGG16 network structure.

TABLE 7.7 mAP results for bimodal fine-grained cross-media retrieval of different feature extraction networks

TABLE 7.8 mAP results for bimodal fine-grained cross-media retrieval of different feature extraction networks

Table 7.9 multi-modal fine-grained cross-media retrieval mAP results for different feature extraction networks

In order to fully utilize the feature information in the channel and reduce the information redundancy, the CFSFL method provided by the invention needs to divide the channel and then respectively represent different local key regions by different channel groups, so that the processing method of the channel is very important for local region generation and fine-grained semantic learning. The channel processing method provided by the invention is used for ablation research. As shown in table 7.10, the present invention will compare the three different channel processing methods. In order to fairly compare the merits of the three different methods, the remaining parameters in the experiment were kept constant. The channel dividing method provided by the invention directly divides the output of the last layer of residual block of the Resnet network into channels, because the number of the output channels is 2048 and cannot be divided by the number of classes of 200, the method of the invention randomly zeros 48 channels, then averagely divides the channels, the number of each group of channels is 10, and the channels are divided according to the number

The 3 channel features are zeroed out and the front and back positions of the feature maps in each group are scrambled. The method 1 performs linear mapping on the output characteristics of the last layer of residual blocks of the Resnet network, maps the number of channels to 2000, and then averagely divides the channels. Method2 the Method of the invention is distinguished in that all grouped features are retained and no channel dropping operation is taken.

TABLE 7.10 different channel treatment methods

Tables 7.11 and 7.12 show the fine-grained cross-media retrieval method in the dual-modal domain for different channel processing methods, and table 7.13 shows the fine-grained cross-media retrieval method in the multi-modal domain. It can be found that the method of directly performing channel division on the output of the last layer of residual block of the Resnet network is much better than the method of performing linear mapping, because linear mapping on different media data will lose part of the unique information of the media, which is not favorable for losing the learning of local areas and fine-grained semantic information which are special for different media with fine-grained loss. By comparing the Method2 with the CFSFL Method provided by the invention, the channel loss ratio set by the invention can be found to be improved to a certain extent in cross-media retrieval. This shows that by setting a reasonable channel loss ratio, each channel within a group can be promoted to fully learn the fine-grained feature representation of a local region.

TABLE 7.11 mAP results of different channel processing methods in bimodal fine-grained cross-media retrieval

TABLE 7.12 mAP results of different channel processing methods in bimodal fine-grained cross-media retrieval

TABLE 7.13 mAP results of different channel processing methods in multimodal fine-grained cross-media retrieval

In addition, the invention also carries out ablation research on the proposed channel characteristic learning method, and the formula in the fine-grained learning method is as follows:

and cross-media learning methods of formula:

to carry outAblation analyzes the effectiveness of these two-part methods. The results of the fine-grained cross-media retrieval in the bimodal domain are shown in tables 7.14 and 7.15, and the fine-grained cross-media retrieval method in the multimodal domain is shown in table 7.16.

By removing the formula in fine-grained learning:

to study the influence of global connection between local key regions on learning fine-grained semantic features in each media, the loss can be defined as:

by removing the formula in cross-media learning:

to investigate whether fine-grained cross-media retrieval can be performed simply by learning representations of fine-grained semantic features within the media and aligning with their fine-grained subcategory labels, the penalty can be defined as:

it can be found that the CFSFL method proposed by the present invention achieves the best effect by jointly considering the local differences and global connections between local key regions in different media and the correlation between different media. By comparison

And

global associations of local key regions within a media can be found to be more important than measures of relevance between different media, due to the equation:

when the method in (1) learns the discriminant features of local key regions in different media, each local region feature is aligned with the fine-grained subcategory label of the local region feature, and the method can effectively promote the data of different media categories to be distributed according to the fine-grained subcategory in the feature space.

TABLE 7.14 mAP results of different losses in bimodal fine-grained cross-media retrieval

TABLE 7.15 mAP results for different penalties in bimodal fine-grained cross-media retrieval

Table 7.16 mAP results of different losses in multimodal fine-grained cross-media retrieval

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims

1. A cross-media retrieval method based on channel fine-grained semantic features is characterized by comprising the following steps:

s2, dividing according to channels and inputting the divided data into a fine-grained learning layer and a cross-media learning layer;

s21, respectively inputting four media data, namely image data, video data, audio data and text data, into a fine-grained learning layer to learn fine-grained discrimination characteristics, and outputting the fine-grained discrimination characteristics as fine-grained loss;

s3, adding results of the fine-grained loss function and the cross-media loss function to obtain cross-media joint loss;

and S4, evaluating the performance of the retrieval method by adopting an experiment.

2. The cross-media retrieval method based on the channel fine-grained semantic features as claimed in claim 1, wherein in S2, in the fine-grained cross-media retrieval task, input data contains four media of images, audio, video and text; training by adopting a multimedia mixed input method, and performing combined input by equally sampling different media data, wherein the network input is

In which

Representing images, video, audio, text,

In which

3. The method of claim 2, wherein four different media data are divided according to channels in the output features of the feature extractor, each channel represents a different feature region with fine-grained discriminant, the channels of the four media data are equally divided into n groups, and each group has a feature size of

By for each group

In one channel

Of a number of channels

The front and back spatial positions of the feature map of each channel; measuring the grouped feature graphs through cross-media joint loss, wherein a loss function consists of fine-grained loss and cross-media loss; the cross-media joint loss is defined as follows:

in order to be lost across the medium,

4. The cross-media retrieval method based on the channel fine-grained semantic features as claimed in claim 2, wherein in S21, the fine-grained semantic features of four media data, namely, image, video, audio and text, are extracted and used for loss measurement by learning the global and local relations between fine-grained local key regions, and the fine-grained loss of the media data is defined as follows:

wherein the content of the first and second substances,

which represents a local loss of fine-grained,

5. The method of claim 4, wherein the cross-media search method based on the channel fine-grained semantic features is characterized in that all feature maps in each group of channels are subjected to channel average pooling and channel maximum pooling, and the channel average pooling layer is used for each group of channels

Feature maps of size are summed by position and then divided by

The output size of each group of feature maps is

(ii) a Maximum pooling of channels for each group

(ii) a Global average pooling layer is achieved by grouping each group

All feature points in the size feature map are added and then divided by

To obtain the semantic features of the feature map and output as

Size; fine-grained local loss is respectively calculated for the n local key region characteristics, and is defined as follows:

wherein the content of the first and second substances,

respectively representing images, video, audio, text,

is a label, and is a label,

is a probabilistic feature.

6. The method of claim 5, wherein the global representation of the feature map is learned through global loss, and firstly, the method is to use

Inputting the feature map of the size after being grouped by channels

(ii) a In order to obtain the characteristic graph most representative of each local characteristic, each group is divided into a plurality of groups through a channel maximum pooling layer

Characteristics of the characteristic diagramFitting the information into a characteristic diagram, taking the maximum value of all the characteristic diagrams in the same group according to the position by the maximum pooling layer, and outputting the characteristic diagram of each group as

(ii) a The most representative n characteristic graphs can be obtained by performing channel maximum pooling on all local key areas; finally, the correlation between the n regions is calculated by fine-grained global penalties, which are defined as follows:

wherein the content of the first and second substances,

7. The method of claim 2, wherein in S22, the input size is set to

(ii) a The difference between these media data is measured by cross-media loss, defined as follows:

wherein the content of the first and second substances,

respectively representing images, video, audio, text,

which represents the ith input sample, is then,

the category center of the ith sample is represented.

8. The cross-media retrieval method based on the channel fine-grained semantic features as claimed in claim 1, characterized in that the performance of the cross-media retrieval method based on the channel fine-grained semantic features is tested experimentally using the mAP score, and the specific calculation method is as follows:

firstly, the accuracy rate P is calculated according to the confusion matrix, and the calculation formula is as follows:

and measuring the missed-detection degree of the model by calculating the recall rate of cross-media retrieval, and calculating the recall rate R according to the confusion matrix, wherein the calculation formula is as follows:

wherein the content of the first and second substances,

；

wherein Q is the number of searches.