CN117891964A

CN117891964A - Cross-modal image retrieval method based on feature aggregation

Info

Publication number: CN117891964A
Application number: CN202410059094.3A
Authority: CN
Inventors: 张艳; 吴红英; 王年; 汪思彤; 严毅
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-04-16

Abstract

The invention discloses a cross-mode image retrieval method based on feature aggregation, which comprises the following steps: processing the acquired footprint image by a CPU processor; sending the footprint data set into a multi-stage feature aggregation network for optimization and loading gray-scale footprint images in a search library; acquiring a dust footprint image to be queried; calculating the similarity between the dust footprint image to be queried and the gray footprint image in the search library; and outputting the personnel information of the gray-scale footprint image which is most similar to the dust footprint image to be queried in the search library. The invention relates to the field of image processing, in particular to a cross-modal image retrieval method based on feature aggregation, which is used for effectively reducing the modal difference between dust footprints and gray footprints and improving the accuracy of cross-modal footprint image retrieval.

Description

Cross-modal image retrieval method based on feature aggregation

Technical Field

The invention relates to the field of image processing, in particular to a cross-mode image retrieval method based on feature aggregation.

Background

At present, a gray-scale footprint database is established for criminals, and the result of the method is greatly dependent on expert experience according to dust footprints left on site and compared with the footprint database. In addition, manual alignment requires a lot of human resources and time costs. The manual comparison efficiency is not high, and the actual effect is not ideal. Therefore, in order to improve the comparison efficiency and accuracy, a cross-modal image retrieval method of dust footprint and gray scale footprint is needed.

However, the two modal images of the dust footprint and the gray footprint have larger modal differences, and the footprint images of different objects and the same modality have similarity. In addition, the cluttered background of the dust footprint is not conducive to extraction of high-representation features, which presents challenges to cross-modal retrieval of footprint images. Therefore, extracting high-characterization features of the same object and different modalities is of great help to footprint retrieval.

Disclosure of Invention

(One) solving the technical problems

Aiming at the defects of the prior art, the invention provides a cross-modal image retrieval method based on feature aggregation, which effectively reduces the modal difference between dust footprints and gray footprints and improves the accuracy of cross-modal footprint image retrieval.

(II) technical scheme

In order to achieve the above purpose, the invention is realized by the following technical scheme: a cross-modal image retrieval method based on feature aggregation comprises the following steps:

s1: processing the acquired footprint image by a CPU processor;

s2: sending the footprint data set into a multi-stage feature aggregation network for optimization and loading gray-scale footprint images in a search library;

S3: acquiring a dust footprint image to be queried;

S4: calculating the similarity between the dust footprint image to be queried and the gray footprint image in the search library;

S5: outputting gray-scale footprint image personnel information most similar to the dust footprint image to be queried in the search library;

wherein, step S2 includes the following steps:

firstly, preprocessing an acquired footprint image: collecting the barefoot dust footprint and gray footprint images by shooting the footprint of the field environment and an optical sensor; image segmentation is adopted on the dust footprint to extract high characterization features, and the bare footprint is segmented from the background to obtain a dust footprint data set; unifying footprint images to 512×512 in size, and performing data augmentation on the gray-scale footprint image data set;

Secondly, training a network model: inputting the footprint images of seven persons in total nine persons as training samples into a network model for training, taking the footprint images of the remaining two persons in nine persons as test samples, and establishing a search library based on the gray scale footprint images;

The specific training steps for training the network model in the second step are as follows:

(1) Constructing a mixed attention module: adopting ResNet as a backbone network, wherein the input of the mixed attention module is a low-level characteristic diagram before each layer of the backbone network and a high-level characteristic diagram after each layer of the backbone network respectively, firstly, respectively sending the low-level characteristic diagram and the high-level characteristic diagram into two 1X 1 convolution layers, then, calculating channel similarity through matrix multiplication and softmax functions by the output of the two convolution layers, sending the low-level characteristic diagram into the 1X 1 convolution layers, then, enhancing channel characteristic representation with a channel similarity matrix through matrix multiplication, finally, converting the characteristic into the size of an original high-level characteristic diagram through one 1X 1 convolution layer, and adding the size with the original high-level characteristic diagram to obtain output; the spatial feature representation can be enhanced by similar operation with the low-level feature map;

(2) And (3) constructing a characteristic aggregation module: the mixed attention module is fused among layers of the backbone network to form a feature aggregation module, a Layer4 Layer is not used, one and two mixed attention modules are respectively fused behind a Layer1 Layer and a Layer2 Layer, the first two modules input a high-level feature map and a low-level feature map before and after each Layer, the low-level feature map input by the finally fused mixed attention module is an original feature before the Layer1 Layer, and the input high-level feature map is output by the former mixed attention module;

(3) Constructing a partial attention module: and adding a partial attention module after a backbone network, focusing on fine-grained partial characteristics, dividing the characteristics after Layer3 layers into 3 mutually non-overlapped parts through a self-adaptive average pooling function, respectively sending each part into three 1X 1 convolution layers, performing matrix multiplication on the output of the first two convolution layers through a softmax activation function, performing matrix multiplication on the output of the first two convolution layers, obtaining a fine-grained partial characteristic, performing weighted summation on the fine-grained partial characteristic and a weight matrix normalized by the softmax activation function, obtaining attention-enhanced partial characteristics, obtaining characteristic vectors by the input characteristics through a global average pooling Layer and a batch normalization Layer, and adding the obtained characteristics of the two parts to obtain output characteristics.

As training, in step S3, the manhattan distance between the dust footprint image to be queried and the gray footprint image in the search library is calculated by using the trained network model and metric function, and the similarity of the dust footprint image to be queried and the gray footprint image in the search library is measured by using the distance.

As training, when footprint images were acquired in the first step, each subject contained 42 barefoot dust footprint images and 6 barefoot gray scale footprint images.

As training, the method of augmenting the grayscale footprint data in the first step includes horizontal flipping, clockwise rotation by 10 ° and counterclockwise rotation by 10 °.

(III) beneficial effects

The invention provides a cross-mode image retrieval method based on feature aggregation, which has the following beneficial effects:

The invention utilizes the barefoot dust footprint image and the barefoot gray footprint image to extract high characterization features, and realizes cross-modal footprint retrieval intellectualization through deep learning. The method can extract the characteristics and calculate the similarity at low cost, and solves the problems of cross-modal differences of footprint images and the like. Compared with manual retrieval, the method can improve the comparison efficiency and accuracy to a certain extent, and effectively realize cross-modal image retrieval of dust footprints and gray footprints. The invention has positive significance for dust footprint comparison and identification by means of artificial intelligence technology.

Drawings

FIG. 1 is a flow chart of an image retrieval method of the present invention;

FIG. 2 is a framework diagram of network optimization and cross-modal retrieval in accordance with the present invention;

FIG. 3 is a block diagram of a feature aggregation module of the present invention;

fig. 4 is a frame diagram of a hybrid attention module of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-4, the present invention provides a technical solution:

As shown in fig. 2, the network optimization and cross-modal retrieval are completed by the following steps:

Firstly, preprocessing an acquired footprint image: collecting barefoot dust footprint and gray footprint images by shooting footprints of field environments and optical sensors, wherein each object comprises 42 barefoot dust footprint images and 6 barefoot gray footprint images; image segmentation is adopted on the dust footprint to extract high characterization features, and the bare footprint is segmented from the background to obtain a dust footprint data set; unifying footprint images to 512×512 in size, and performing data augmentation on the gray scale footprint image dataset, wherein the augmentation method comprises horizontal overturn, clockwise rotation by 10 ° and anticlockwise rotation by 10 °;

Secondly, training a network model: inputting the footprint images of seven persons in total nine persons as training samples into a network model for training, taking the footprint images of the remaining two persons in nine persons as test samples, and establishing a search library based on the gray scale footprint images; inputting the dust footprint image to be queried into the network, extracting query features by the network, comparing the query features with features in a search library, determining the similarity by calculating Manhattan distance, indicating that the smaller the distance is, the more similar the distance is, and finally outputting the prediction accuracy.

As shown in fig. 4, the specific training steps for the network model are as follows:

(2) And (3) constructing a characteristic aggregation module: as shown in fig. 3, a mixed attention module is added between each Layer of the backbone network to form a feature aggregation module, a Layer4 Layer is not used, a mixed attention module is added behind a Layer1 Layer, a high-level feature map and a low-level feature map are used behind the Layer1 Layer, two mixed attention modules are added behind the Layer2 Layer, the former feature is used behind the Layer2 Layer, the last low-level feature map input by the added mixed attention module is an original feature in front of the Layer1 Layer, and the input high-level feature map is the output of the former mixed attention module;

(3) Constructing a partial attention module: adding a partial attention module after a backbone network, focusing on fine-grained partial characteristics, dividing the characteristics of the Layer3 Layer into 3 mutually non-overlapped parts through a self-adaptive average pooling function, respectively sending each part into three 1X 1 convolution layers, performing matrix multiplication on the output of the first two convolution layers through a softmax activation function, performing matrix multiplication on the output of the first two convolution layers, obtaining a fine-grained partial characteristic, performing weighted summation on the fine-grained partial characteristic and a weight matrix normalized by the softmax activation function, obtaining attention-enhanced partial characteristics, obtaining characteristic vectors by the input characteristics through a global average pooling Layer and a batch normalization Layer, and adding the obtained characteristics of the two parts to obtain output characteristics;

third step, cross-modal footprint retrieval: when we need to find an object in the footprint retrieval library that matches the footprint image of unknown identity to be queried, the footprint image to be queried is first provided to a trained network model. The Manhattan distance between the image and each image in the search pool is then calculated using a metric function of the network model. The distance is used as a basis for measuring the similarity between the image to be detected and the image of the search library. And finally, returning the personnel information closest to the distance to the user as an output result. Can help find possible matching objects in the scene of footprint image comparison.

Based on the steps, as shown in fig. 1, a cross-mode image retrieval method based on feature aggregation specifically comprises the following operation steps:

s1: processing the acquired footprint image by a CPU processor;

S3: acquiring a dust footprint image to be queried;

S5: and outputting the personnel information of the gray-scale footprint image which is most similar to the dust footprint image to be queried in the search library.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A cross-modal image retrieval method based on feature aggregation is characterized by comprising the following steps of: the method comprises the following steps:

s1: processing the acquired footprint image by a CPU processor;

S3: acquiring a dust footprint image to be queried;

wherein, step S2 includes the following steps:

2. The cross-modal image retrieval method based on feature aggregation as claimed in claim 1, wherein: in step S3, the cross-modal retrieval function provides the footprint image to be queried to the trained network model, calculates the manhattan distance between the dust footprint image to be queried and the gray footprint image in the retrieval library by using the trained network model and the metric function, and measures the similarity of the dust footprint image to be queried and the gray footprint image in the retrieval library by using the distance.

3. The cross-modal image retrieval method based on feature aggregation as claimed in claim 1, wherein: in the first step, footprint images were acquired, each object contained 42 barefoot dust footprint images and 6 barefoot gray scale footprint images.

4. The cross-modal image retrieval method based on feature aggregation as claimed in claim 1, wherein: the method of augmenting the grayscale footprint data in the first step includes a horizontal flip, a clockwise rotation of 10 ° and a counterclockwise rotation of 10 °.