CN111737512B

CN111737512B - Silk cultural relic image retrieval method based on depth feature region fusion

Info

Publication number: CN111737512B
Application number: CN202010498104.5A
Authority: CN
Inventors: 赵鸣博; 沙晟涛
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2021-11-12
Anticipated expiration: 2040-06-04
Also published as: CN111737512A

Abstract

The invention relates to a silk cultural relic image retrieval method based on depth feature region fusion, which is characterized by comprising the following steps of: classifying and learning the silk cultural relic image by adopting a deep learning global feature extraction mode; selecting an activation area corresponding to a silk cultural relic image of a certain category by adopting a neural network visualization mode, and further realizing retrieval target positioning; fusing the characteristics related to the target area in a regional characteristic fusion mode to be used as a local descriptor of the target; and selecting the silk cultural relic image with the characteristic distance closest to the user request picture for retrieval. Aiming at the characteristic that the silk cultural relic image retrieval target usually only occupies a small part, the invention can accurately position and extract fine-grained characteristics of the retrieval target by combining depth characteristic extraction and candidate retrieval areas, thereby improving the silk cultural relic image retrieval performance and realizing the small target retrieval of the silk cultural relic image.

Description

Silk cultural relic image retrieval method based on depth feature region fusion

Technical Field

The invention relates to a retrieval method of a silk cultural relic image, in particular to a retrieval method of a silk cultural relic image based on depth feature extraction and fine-grained region fusion, and belongs to the technical field of information.

Background

The development and the propagation of silk cultural relic image information resources as a widely utilized medium are witnessed. The silk cultural relic retrieval method adopting the depth feature extraction can effectively manage the rapidly-increased silk cultural relic image data set, and displays the traditional silk cultural relic path to a great number of users in a digital mode through a network means.

The silk cultural relic retrieval method adopting depth feature extraction at present is mainly based on global features, namely, the output of a full connection layer of a depth feature network is adopted as a feature descriptor, so that the whole semantic information of an image is kept. The global-based method mostly focuses on image classification type retrieval tasks, and the feature extraction method is also based on global full-connected layer output. However, since the convolutional neural network mainly encodes global spatial information, the obtained features lack invariance to geometric transformations such as scale, rotation, translation and the like and spatial layout changes of the image, and robustness of the convolutional neural network to highly variable image retrieval is limited. Meanwhile, for silk images, the retrieval target only occupies a small part of the whole image, so for the small target retrieval problem, the small target cannot be effectively represented and the small target area cannot be accurately positioned by the global-based feature.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the existing silk cultural relic retrieval method cannot realize small target retrieval and positioning.

In order to solve the technical problem, the technical scheme of the invention is to provide a silk cultural relic image retrieval method based on depth feature region fusion, which is characterized by comprising the following steps:

step 1, classifying and learning silk cultural relic images by adopting a deep learning global feature extraction mode, and classifying all silk cultural relic images into different categories;

step 2, selecting an activation area corresponding to the silk cultural relic image of a certain category determined in the step 1 by adopting a neural network visualization mode, and further realizing retrieval target positioning, wherein the method comprises the following steps:

step 201, fusing the characteristic surfaces of the silk cultural relic images of a certain specific category determined in the step 1 by using a Grad-CAM method to obtain a Grad-CAM image;

step 202, performing global mean pooling on the Grad-CAM graphs of each category, namely taking and scoring the mean value of the Grad-CAM graphs, and reserving the Grad-CAM graphs higher than a certain threshold value to indicate that the Grad-CAM graphs contain targets of the current category;

step 203, positioning the specific position of the target of the corresponding category according to the reserved contour of the Grad-CAM diagram, and realizing target positioning;

and 3, fusing the features related to the target region in a region feature fusion mode to be used as the local descriptor of the target, wherein the method comprises the following steps:

step 301, positioning a detection target to obtain a sensor feature surface of which the convolution result of the target in a positioning area is H multiplied by W multiplied by D, wherein H, W, D respectively represents the height, width and channel number of the feature surface;

step 302, adopting a policy of Region Maximum Activation of constants, regarding H × W × D sensor feature surfaces as D H × W-dimensional descriptors, and performing local average pooling or Maximum pooling on the D H × W descriptors to obtain a D-dimensional feature to represent the target;

and 4, obtaining a user request picture, obtaining the characteristics of the user request picture by adopting the methods in the steps 2 and 3, calculating Euclidean distances between the characteristics of the user request picture and the characteristics of each type of silk cultural relic images in a local characteristic space, and selecting the type of silk cultural relic image closest to the characteristics of the user request picture for retrieval.

Preferably, in step 1, in the classification learning, the target data is subjected to classification fine tuning on a pre-training model by using a transfer learning manner.

Preferably, in step 302, if one picture contains multiple objects, D-dimensional features of different objects are concatenated as output in a region feature fusion manner.

Aiming at the characteristic that the silk cultural relic image retrieval target usually only occupies a small part, the invention can accurately position and extract fine-grained characteristics of the retrieval target by combining depth characteristic extraction and candidate retrieval areas, thereby improving the silk cultural relic image retrieval performance and realizing the small target retrieval of the silk cultural relic image.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The invention is improved on the basis of the existing method for extracting the global features based on deep learning so as to realize the small target retrieval and positioning of the silk cultural relic image.

The invention provides a silk cultural relic image retrieval method based on depth feature region fusion, which comprises the following steps:

step 1, classifying and learning the silk cultural relic image by adopting a deep learning global feature extraction mode, thereby keeping global classification information of features. During classification learning, the target data is classified and fine-tuned on a pre-training model (such as VGGNet or ResNet) by using a transfer learning mode, so that the fine-tuned CNN network feature plane can contain classification information. The classification information is only coarse-grained learning, and the subsequent fine-grained learning needs to be finely adjusted through a network.

And 2, selecting an activation area corresponding to the silk cultural relic image of a certain category determined in the step 1 by adopting a neural network visualization mode, and further realizing retrieval target positioning.

The step 2 comprises the following steps:

step 201, fusing the specific characteristic surface of the silk cultural relic image of a certain category determined in step 1 by using a Grad-CAM (Gradient-weighted Class Activation Mapping) method to obtain a Grad-CAM image so as to achieve the purpose of visualizing the target area. The core idea is to perform weighted fusion on a certain convolution layer feature plane to visualize object information of a specific type.

Step 202, performing Global Average Pooling (Global Average Pooling) on the Grad-CAM map of each category, namely, taking the Average of the Grad-CAM maps and scoring (voting), wherein the Grad-CAM map higher than a certain threshold value is reserved, which indicates that the Grad-CAM map contains the target of the current category.

And step 203, positioning the specific position of the target of the corresponding category according to the reserved contour of the Grad-CAM diagram, and realizing the positioning of the retrieval target.

And 3, fusing the features related to the target region in a region feature fusion mode to be used as the local descriptor of the target.

The step 3 comprises the following steps:

step 301, positioning the detected target to obtain a convolutional result of the target in a positioning area of the target, wherein the convolutional result is an H × W × D Tensor feature plane, and H, W, D respectively represents the height, width and channel number of the feature plane.

Step 302, in order to convert the sensor feature plane into a feature vector representing the target, a strategy of Region Maximum Activation of constants is adopted, and the H multiplied by W multiplied by D sensor feature plane is regarded as a D H multiplied by W dimensional descriptor. And performing local average pooling or maximum pooling on the D H multiplied by W descriptors to obtain a D-dimensional feature to represent the target.

Step 303, if one picture contains a plurality of targets, D-dimensional features of different targets can be connected in series in a region feature fusion manner to serve as output.

Claims

1. A silk cultural relic image retrieval method based on depth feature region fusion is characterized by comprising the following steps:

step 2, selecting an activation area corresponding to each type of silk cultural relic image determined in the step 1 by adopting a neural network visualization mode, and further realizing target positioning, wherein the method comprises the following steps:

step 201, fusing the characteristic surfaces of the silk cultural relic images of each category determined in the step 1 by using a Grad-CAM method to obtain a Grad-CAM image;

2. The silk cultural relic image retrieval method based on depth feature region fusion as claimed in claim 1, wherein in step 1, during the classification learning, a migration learning mode is used to perform classification fine adjustment on the target data on a pre-training model.

3. The silk cultural relic image retrieval method based on depth feature region fusion as claimed in claim 1, wherein in step 302, if a picture comprises a plurality of objects, D-dimensional features of different objects are connected in series as output by using a region feature fusion mode.