CN110019652B

CN110019652B - Cross-modal Hash retrieval method based on deep learning

Info

Publication number: CN110019652B
Application number: CN201910196009.7A
Authority: CN
Inventors: 董西伟; 邓安远; 周军; 杨茂保; 孙丽; 胡芳; 贾海英; 王海霞
Original assignee: Jiujiang University
Current assignee: Jiujiang University
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2022-06-03
Anticipated expiration: 2039-03-14
Also published as: CN110019652A

Abstract

Cross-modal Hash retrieval method based on deep learning, and hypothesis

A set of pixel feature vectors of an image modality of an individual object of

The method is characterized by comprising the following steps: (1) binary hash coding for sharing image mode and text mode by using target function designed based on deep learning technology

Deep neural network parameters for image modalities and text modalities

And

and projection matrix of image modality and text modality

And

(ii) a (2) Solving for unknown variables in an objective function using an alternate update approach

、

、

And

(ii) a (3) Depth neural network parameters based on image modality and text modality obtained by solving

And

and a projection matrix

And

(ii) a (4) Calculating the Hamming distance from the query sample to each sample in the retrieval sample set based on the generated binary Hash codes; (5) the retrieval of the query sample is accomplished using a cross-modal retriever based on approximate nearest neighbor searching. The method effectively improves the performance of cross-modal hash retrieval.

Description

Cross-modal Hash retrieval method based on deep learning

Technical Field

The invention relates to a cross-modal Hash retrieval method based on deep learning.

Background

With the rapid development of scientific and technical and social productivity, the big data era turns to something else. By big data is meant a collection of data that cannot be captured, managed and processed within a certain time frame using conventional software tools. IBM proposes that big data have 5V characteristics, namely: volume (large data Volume), Variety (diversified types and sources), Value (relatively low data Value density, but sometimes rare), Velocity (rapid data growth rate), and Veracity (quality of data). Big data can also be considered an information asset that requires new processing modes to have greater decision-making, insight discovery, and process optimization capabilities.

Information retrieval is an important aspect of data processing, and in the face of big data, how to effectively perform information retrieval becomes an urgent and very challenging problem in the big data era. For large-scale data retrieval, the hash retrieval method plays an important role. The Hash retrieval method maps the high-dimensional characteristics of the object to the Hamming space to generate a low-dimensional Hash code to represent the object, reduces the requirement of a retrieval system on the memory space of a computer, improves the retrieval speed and can better adapt to the requirement of mass retrieval. The main idea of the hash search is to project data represented by high-dimensional vectors into a Hamming space and search K neighbors (K is more than or equal to 1) in the Hamming space. In order to keep K neighbors in hamming space consistent with the original space, the hash learning algorithm needs to satisfy local retention characteristics, i.e., to maintain the similarity before and after data projection. The Local Sensitive Hashing (LSH) method can make two points with a very close distance in a high dimensional space, and after the two points are hash-coded by a hash function, the hash codes of the two points have a very high probability of being the same, otherwise, if the distance between the two points is far, the probability of the same hash code is very small.

The cross-modality hash retrieval is mainly used for solving the mutual retrieval problem among different modality data, for example, retrieving texts by using images, retrieving images by using texts, and the like. The cross-modal hash retrieval method needs to perform hash encoding on data in different modalities to generate compact binary hash codes, and then completes mutual retrieval between the data in different modalities based on the generated hash codes. Ding et al propose a Collective Matrix Factorisation Hashing (CMFH) method. The CMFH method may learn a uniform hash encoding from the different modalities of each instance using collective matrix decomposition. In order to effectively use category information and maintain a local geometric structure in a cross-modal Hashing method based on Matrix decomposition and further achieve the purpose of effectively improving the identification capability of potential semantic features obtained by Matrix decomposition, Tang et al propose a Supervised Matrix Factorization Hashing (SMFH) method. When the SMFH method is used for Hash coding learning, consistency of marking information among the modes is considered, and consistency of local geometric structures inside the modes is also considered. Aiming at the problem that the training time complexity of a supervised cross-modal hashing method is high, Zhang et al propose a supervised cross-modal hashing method called Semantic Correlation Maximization (SCM). The SCM approach can seamlessly integrate semantic labeling information into the hash learning process.

The manual features used by the cross-modal hashing algorithm based on the shallow learning structure may not achieve optimal compatibility with hash code learning. To address this problem, Jiang et al propose a Deep Cross-modal Hashing (DCMH) method. The DCMH method is an end-to-end cross-modal Hash method based on a deep learning framework, and can effectively integrate feature learning and Hash coding learning into a learning framework. To improve the Quantization capability (Quantization) of depth feature representation in an end-to-end learning architecture so that the depth feature representation can be quantized more efficiently, Cao et al propose a Collective Depth Quantization (CDQ) method by introducing Quantization into the end-to-end depth learning architecture for cross-modal retrieval. The CDQ method jointly learns the depth feature representation and quantizer for both modalities through a well-designed hybrid network and loss function. The hybrid network of the CDQ method comprises: an image network composed of a plurality of Convolution-Pooling (fusion-Pooling) layers for extracting image feature representations, a text network composed of a plurality of Fully-Connected (fused-Connected) layers for extracting text feature representations, two Fully-Connected Bottleneck (fused-Connected bottle) layers for generating optimal low-dimensional feature representations, an adaptive cross-entropy loss for capturing cross-modal correlations, and a collective quantization loss for controlling hash quality and quantization capability. In addition, the CDQ method can also learn a quantizer codebook common to the modalities, by which the association between the two modalities can be substantially enhanced. In order to effectively capture the essential relationship between different modalities in an end-to-end Deep learning architecture for cross-modality retrieval, Yang et al propose a pair relationship oriented Deep Hashing (PRDH) method. The PRDH method learns the Hash codes which can reflect the intrinsic relationship between the modes better by integrating different types of pair-wise constraints from the view point in the modes and the view point between the modes. In addition, the PRDH method enhances the discrimination ability of hash-coding each bit by introducing decorrelation constraints in the deep learning architecture.

Cross-modal Hash retrieval requires mapping high-dimensional feature data of an object in different modes to a low-dimensional Hamming space so as to realize that binary Hash coding based on the Hamming space can quickly and accurately complete a cross-modal information retrieval task. Most of the existing cross-modal hash retrieval methods are methods based on a shallow learning structure, although the methods can quickly complete retrieval tasks based on a hash retrieval technology, the shallow learning structure enables identification information in original features not to be mined well. The deep learning technology has excellent characteristic learning capability in tasks such as classification and object detection, and the existing cross-modal hash retrieval method based on the deep learning technology also shows that the deep learning technology is beneficial to improving the performance of the cross-modal retrieval task. Therefore, the cross-modal Hash retrieval method based on the deep learning technology is designed, and has important significance and value for completing the cross-modal retrieval task under the big data situation.

Disclosure of Invention

The invention aims to provide a cross-modal hash retrieval method based on deep learning, and solves the problem that the existing cross-modal hash retrieval method based on a shallow learning structure cannot well mine identification information in original features.

In order to achieve the purpose, the technical scheme is that a cross-modal Hash retrieval method based on deep learning is adopted, and a pixel feature vector set of image modalities of n objects is assumed as

Wherein v is_iA pixel feature vector representing the ith object in the image modality; order to

A feature vector representing the n objects in a text modality, wherein t_iA feature vector representing the ith object in a text modality; representing a class label vector of n objects as

Wherein c represents the number of object categories; for vector y_iIn other words, if the ith object belongs to the kth class, let vector y_iIs 1, otherwise, the vector y_iThe kth element of (1) is 0; the method comprises the following steps:

(1) obtaining a binary Hash code B shared by an image mode and a text mode by using an objective function designed based on a deep learning technology, and obtaining a deep neural network parameter theta of the image mode and the text mode_vAnd theta_tAnd projection matrix P for image modality and text modality_vAnd P_t；

(2) Solving unknown variables B and theta in objective function by using alternative updating mode_v、θ_t、P_vAnd P_tI.e. solving the following three sub-problems alternately: fixed B, P_vAnd P_tSolving for theta_vAnd theta_t(ii) a Fixed B, theta_vAnd theta_tSolving for P_vAnd P_t(ii) a Fixed theta_v、θ_t、P_vAnd P_tSolving for B;

(3) depth neural network parameter theta based on image mode and text mode obtained by solving_vAnd theta_tAnd a projection matrix P_vAnd P_tGenerating binary hash codes for the query samples and the samples in the retrieval sample set;

(4) calculating the Hamming distance from the query sample to each sample in the retrieval sample set based on the generated binary Hash codes;

(5) the retrieval of the query sample is accomplished using a cross-modal retriever based on approximate nearest neighbor searching.

Wherein, the objective function designed based on the deep learning technique in the step (1) has the following form:

wherein, γ₁And gamma₂Is a non-negative balance factor, B ═ B₁,b₂,…,b_n]T∈{-1,+1}^n×k，

And

for projection of the matrix, θ_vAnd theta_tIn order to be the parameters of the deep neural network,

and

depth features of n objects in image modality and text modality, respectively, and matrixThe vectors in the ith column of the matrix G and F are respectively F (v)_i；θ_v) And g (t)_i；θ_t)，

Is a Laplace matrix for keeping intra-modal consistency and inter-modal consistency, 1 is a column vector with all elements being 1, | · | | | survival_FFrobenius norm representing the matrix, tr (·) represents the trace of the matrix, (·)^TRepresenting the transpose of the matrix.

Wherein the step (2) solves the unknown variables B and theta in the objective function by using an alternate updating mode_v、θ_t、P_vAnd P_tSpecifically, the following three sub-problems are solved alternately:

(1) fixed B, P_vAnd P_tSolving for theta_vAnd theta_t(ii) a When the binary hash code B is fixed, and the projection matrix P_vAnd P_tIn time, the objective function shown in equation (1) reduces to a parameter θ with respect to the deep neural network_vAnd theta_tThe sub-problems of (a):

(2) fixed B, theta_vAnd theta_tSolving for P_vAnd P_t(ii) a When the binary hash code B is fixed, and the deep neural network parameter theta_vAnd theta_tThe objective function shown in equation (1) is reduced to that about the projection matrix P_vAnd P_tThe sub-problems of (a) are:

(3) fixed theta_v、θ_t、P_vAnd P_tSolving for B; when the depth of the neural network parameter theta is fixed_vAnd theta_tAnd a projection matrix P_vAnd P_tWhen, the objective function shown in equation (1) is reduced to that ofThe sub-problem of binary hash coding B, namely:

in solving the unknown variable B in equation (4), a discrete hash algorithm based on singular value decomposition is used for the solution.

Wherein, the depth neural network parameter theta based on the image modality and the text modality obtained by the solution in the step (3)_vAnd theta_tAnd a projection matrix P_vAnd P_tGenerating a binary hash code for the query sample and the samples in the search sample set, in particular, assuming that a feature vector of a query sample of the image modality is

The feature vector of a query sample of the text modality is

The image mode searches the characteristics of the samples in the sample set as

The text modal search sample set is characterized by

Wherein the content of the first and second substances,

representing the number of samples in the search sample set; the binary hash codes of the image mode query sample and the text mode query sample and the retrieval sample set are respectively as follows:

and

wherein the content of the first and second substances,

sign (·) is a sign function.

Wherein, in the step (4), the hamming distance from the query sample to each sample in the retrieval sample set is calculated based on the generated binary hash code, specifically, a formula is used

Computing a query sample of an image modality to a first in a set of text modality search samples

Hamming distance of individual samples; using the formula

Computing a query sample of a text modality to a first in a set of image modality search samples

Hamming distance of individual samples.

Wherein, the step (5) uses a cross-modal searcher based on approximate nearest neighbor search to complete the search of the query sample, specifically, the Hamming distance obtained by calculation

(or

) And sorting according to the sequence from small to large, and then taking the samples corresponding to the first K minimum distances in a text mode (or an image mode) retrieval sample set as retrieval results.

Advantageous effects

Compared with the prior art, the invention has the following advantages.

1. The method can utilize the deep learning structure to dig out more identification information under the condition of keeping the retrieval speed, thereby more accurately completing the cross-modal retrieval;

2. the method of the invention fully keeps the beneficial information of the original characteristic space to the Hamming space by implementing the intra-modal consistency and inter-modal consistency keeping strategies, thereby promoting the excavation and searching performance of the identification information;

3. the discrete hash algorithm based on singular value decomposition provided by the method can enable the obtained binary hash code to have more beneficial characteristics, and further effectively improves the performance of cross-modal hash retrieval.

Drawings

Fig. 1 is a flowchart of a cross-modal hash retrieval method based on deep learning according to the present invention.

Detailed Description

The technical solution of the present invention is further described in detail below with reference to the accompanying drawings.

The invention discloses a cross-modal Hash retrieval method based on deep learning, which mainly comprises the following steps in the specific implementation process as shown in figure 1: assume a set of pixel feature vectors for an image modality of n objects as

Wherein c represents the number of object categories; for vector y_iIn other words, if the ith object belongs to the kth class, let vector y_iIs 1, is the k-th element of (1),otherwise, vector y_iThe kth element of (1) is 0;

(1) cross-modal Hash retrieval target function construction based on deep learning

The method aims to learn the hash functions of the image modality and the text modality by utilizing the characteristic data V and T of the image modality and the text modality and the class mark information of the object, and generate the binary hash code for completing the cross-modal hash retrieval task by utilizing the learned hash function. The characteristic data V and T of the image mode and the text mode are directly used for cross-mode hash learning, so that the method is not beneficial to mining identification information from the original characteristics to generate binary hash codes with excellent performance. In order to better mine identification information from original feature data of an image modality and a text modality, the method of the invention constructs a Deep Neural Network (DNN) for the image modality and the text modality data respectively to carry out Deep feature learning.

For an image mode, the method adopts a Convolutional Neural Network (CNN) which is obtained by AlexNet improvement and is composed of seven layers to carry out image mode depth feature learning. This CNN model is described in detail below.

This CNN model for image modality depth feature learning contains five convolutional layers (Convolution Layer) and two Fully Connected layers (full Connected Layer), denoted "Conv 1-Conv 5" and "Fc 6-Fc 7", respectively. The network takes as input the pixel characteristics of the image modality. In this CNN, the first convolution layer Conv1 is filtered in 4-pixel steps using 96 input images of 11 × 11 × 3 core size and 227 × 227 × 3 core size. After activation, MAX-pooling and Local Response Normalization (LRN) by a Linear correction Unit (Rectified Linear Unit, ReLU), an output signature of 27 × 27 × 96 is obtained. The second convolutional layer Conv2 takes as input the output of the first convolutional layer Conv1, and Conv2 filters with 256 collation inputs of size 5 × 5 × 96. Similarly, after passage through ReLU, MAX-firing and LRN, an output characteristic of size 13 × 13 × 256 is obtained. Third, fourth and fifth convolutional layers Conv3, Conv4 and Conv5 384, 384 and 256 convolution kernels of size 3 x 256, 3 x 384 and 3 x 384, respectively, are used and each layer is activated with ReLU. Output characteristics of 6X 256 were obtained when Conv5 was subjected to MAX-firing. The number of neurons of the fully connected layer Fc6 was 4096, and the neurons were temporarily discarded using a discard ratio of 0.5 to prevent overfitting. Fc7 layer containing d^(v)The fully-connected layer of individual neurons, and the Hyperbolic Tangent (TanH) function is used as the activation function of the Fc7 layer. Finally, a size d was obtained at Fc7 layer^(v)Output characteristic of x 1.

For text mode, the method uses a multi-layer Perceptron (MLP) composed of three fully-connected layers to construct an MLP deep neural network to map the features of the text mode from the original feature space to the semantic space. The three fully-connected layers in the constructed MLP deep neural network are denoted herein as Fc1, Fc2, and Fc3, respectively. Similar to the practice of constructing the MLP deep neural network in text modal feature learning of the related literature, the Fc1 layer and the Fc2 layer of the MLP deep neural network constructed by the invention use ReLU as a nonlinear activation function. The hyperbolic tangent (TanH) function is used as the activation function for the Fc3 layer. The number of neurons in Fc3 layer is d^(t)Namely: the dimension of the output feature of the MLP deep neural network for learning the depth features of the text mode is d^(t)。

For the ith object, order

Output characteristics of CNN representing image modality, where θ_vA parameter of CNN that is an image modality; order to

An output characteristic of the MLP deep neural network representing a text modality, wherein_tParameters of the MLP deep neural network that are text modalities.

Suppose that the ith object has a deep learning feature f (v) in the image modality and the text modality_i；θ_v) And g (t)_i；θ_t) By linear projectionShadow matrix

And

the characteristics after projection are respectively

And P_t ^Tg(t_i；θ_t) Wherein, wherein^TRepresenting the transpose of the matrix. Further suppose that

And P_t ^Tg(t_i；θ_t) Binary hash codes in hamming space can be generated separately

And

then, cross-modality hash learning can be performed by the following minimization problem:

wherein, γ₁And gamma₂For non-negative balance factors, the third term to the right of the equal sign is a regular term to prevent overfitting, and the fourth term to the right of the equal sign is to expect that the probability of each bit of the hash code being +1 and-1 is equal and to maximize the information provided by each bit of the hash code.

Intra-modality similarity reflects a close-proximity relationship between data points composed of feature vectors within each modality. Two data points v of an image modality_iAnd v_jThe intra-modal similarity between can be defined as:

wherein the content of the first and second substances,

representing a data point v_iK of (a)₁Neighbor set (k)₁-nearest neighbors)，

Denotes v_iAnd v_jThe euclidean distance between, i.e.:

representing vector l₂And (4) norm. Sigma for control

The decay rate of (c). Similarly, two data points t composed of two feature vectors in the text modality_iAnd t_jIntra-modal similarity of

Is defined as:

wherein the content of the first and second substances,

for each modality, in order to keep the local neighbor structure of the data points consistent in hamming space and original feature space, namely: each data point and its neighbor relation in the original feature space are maintained in hamming space, and the following objective function can be designed:

based on the class label information of the object, a data point v of the image modality can be defined_i(i＝1,2,…N) data point t of text mode_j(j ═ 1,2, …, n) of the semantic correlation matrix shown below:

it should be noted that: as long as v_iAnd t_jThey are considered to have the same semantics if they belong to at least one of the same category. In order to maintain inter-modal consistency between image and text modalities in hamming space, the following objective function can be designed:

combining the above analysis on image modal depth feature learning, text modal depth feature learning, intra-modal consistency and inter-modal consistency preservation, the objective function of the method of the invention can be designed as:

according to prior work, if data in different modality spaces have the same semantics, the data in these different modalities often correspond to a common underlying space. Thus, the present invention assumes that features with the same semantics in both the image modality and the text modality can ultimately be represented as the same binary hash encoding in a common hamming space. That is to say that there are

This is true. Based on this assumption, the optimization problem in equation (7) can be expressed as:

wherein the content of the first and second substances,

by means of a simple derivation,

can be rewritten as follows:

wherein, B ═ B₁,b₂,…,b_n]T∈{-1,+1}^n×k，

And the vectors of the ith columns of the matrix F and the matrix G are respectively F (v)_i；θ_v) And g (t)_i；θ_t)，||·||_FThe Frobenius norm of the matrix is represented. To pair

Its equivalent can be derived by following the derivation:

wherein the content of the first and second substances,

L-D-W is a laplacian matrix,

denotes the ith diagonal element, w, of the diagonal matrix D_ijFor an element in row ith and column jth of matrix W, tr (-) represents a trace of the matrix. From equations (12) and (13), equation (8) can be rewritten as:

(2) solving of an objective function

The objective function shown in equation (14) includes five unknown variables to be solved, namely: binary Hash code matrix B, Linear projection matrix P_vAnd P_tDeep neural network parameter θ_vAnd theta_t. The objective function shown in equation (14) is non-convex for these five joint unknown variables, and therefore, an analytical solution for these five unknown variables cannot be obtained simultaneously. The unknown variables in equation (14) can be solved by solving three sub-problems alternately, namely: fixed B, P_vAnd P_tSolving for theta_vAnd theta_t(ii) a Fixed B, theta_vAnd theta_tSolving for P_vAnd P_t(ii) a Fixed theta_v、θ_t、P_vAnd P_tAnd B is solved.

(a) Fixed B, P_vAnd P_tSolving for theta_vAnd theta_t

When the binary hash code B is fixed, and the projection matrix P_vAnd P_tIn time, the objective function shown in equation (14) reduces to θ with respect to the deep neural network parameter_vAnd theta_tThe sub-problems of (a):

the invention uses a Back Propagation (BP) algorithm to learn and update DNN network parameters theta_v. Similar to most existing deep learning methods, a random gradient descent algorithm based on back propagation is used here to learn θ_v. Learning theta_vThe specific method comprises the following steps: selecting a small batch of training samples from the training samples at each iteration, and then learning theta by using a random gradient descent algorithm based on backward propagation by using the selected samples_v. Each feature vector v of the image modality for a selected training sample_iFirst, the gradient is calculated using the following formula:

then, using the chain rule and the obtained

Calculating out

Finally, using the calculation

Updating DNN network parameter theta of image modality by BP algorithm_v。

Algorithm 1 shows the solution of the network parameter theta of the image modality DNN_vThe algorithm of (1).

Similarly, learning the deep neural network parameter θ of the updated text modality using a back-propagation-based stochastic gradient descent algorithm_t. Each feature vector t of the text modality for a selected training sample_iFirst, the following gradient is calculated:

then, using a chain gaugeThen the obtained gradient is summed

Computing

Finally, using the calculated

Updating DNN network parameter theta of text mode by BP algorithm_t. Using an algorithm similar to algorithm 1, the DNN network parameter θ of the text mode can be learned_t。

(b) Fixed B, theta_vAnd theta_tSolving for P_vAnd P_t

When the binary hash code B is fixed, and the deep neural network parameter theta_vAnd theta_tThe objective function shown in equation (14) reduces to that about the projection matrix P_vAnd P_tThe sub-problems of (a):

for in equation (18)

Respectively about P_vAnd P_tTaking the partial derivative and making the partial derivative equal to 0, one can obtain:

by simple derivation:

P_v＝(FF^T+I+F11^TF^T)^-1FB， (21)

P_t＝(GG^T+I+G11^TG^T)^-1GB， (22)

wherein, I is a unit matrix, (·)^-1Representing the inverse of the matrix.

(c) Fixed theta_v、θ_t、P_vAnd P_tSolving for B

When the neural network parameter theta is fixed in depth_vAnd theta_tAnd a projection matrix P_vAnd P_tIn time, the objective function shown in equation (14) reduces to a sub-problem with binary hash encoding B, namely:

a simple derivation of equation (23) can yield:

because of P_v、P_t、θ_vAnd theta_tIs fixed, and therefore,

and

are all constants. Further, ignoring these two terms in equation (24) does not affect the solution of B. Furthermore, since B ∈ { -1, +1}^n×kCan obtain

That is to say, the number of the first and second,

is a constant. After discarding the constant terms in equation (24), equation (24) is transformed into:

wherein, the first and the second end of the pipe are connected with each other,

since the unknown variable in the formula (25) is a discrete variable, it is generally difficult to directly solve the discrete variable to obtain an analytical solution. The invention proposes a discrete hash algorithm based on singular value decomposition to solve the optimization problem on the discrete variable B shown in the formula (25). The discrete hash algorithm based on singular value decomposition is described in detail below.

Singular value decomposition of the matrix L can be obtained

Wherein the content of the first and second substances,

is a diagonal matrix. Will be provided with

Substituting equation (25) yields:

order to

And

respectively represent a matrix B,

And

row i of (1); order to

And

respectively represent a matrix B,

And

is at removing

And

the remaining rows then form the matrix. At this time, it is possible to obtain:

similarly, one can obtain:

wherein the content of the first and second substances,

the ith column of the matrix Q is represented,

the representation matrix Q is removing Q_iThe remaining columns form a matrix.

According to equation (27) and equation (28), the unknown binary hash-coding matrix B in equation (26) can be obtained by solving the following for B_iThe optimization problem of (i ═ 1,2, …, n) results, namely:

by simple derivation, equation (29) can be converted to:

the optimization problem in equation (30) has the following analytical solution:

where sign (·) represents a sign function.

Algorithm 2 shows a discrete hash algorithm based on singular value decomposition.

(3) Generating a binary hash of a sample in a set of query and search samples

Assume that a query sample of an image modality has a feature vector of

The feature vector of a query sample of the text modality is

The image mode searches the characteristics of the samples in the sample set as

The text modal search sample set is characterized by

Wherein the content of the first and second substances,

representing the number of samples in the search sample set. Projection matrix P of image mode and text mode obtained by solving_vAnd P_tAnd the deep neural network parameter theta for image modality and text modality_vAnd theta_tThe binary hash codes of the query sample and the retrieval sample set in the image mode and the text mode can be obtained as follows:

and

wherein the content of the first and second substances,

sign (·) is a sign function.

(4) Calculating the Hamming distance from the query sample to each sample in the search sample set

Query sample for image modalities

Using the formula

Computing query samples for image modalities

Retrieval of sample set samples to text modality

Hamming distance of. Query sample for text modalities

Using the formula

Computing query samples for text modalities

Retrieval of samples in a sample set to an image modality

Hamming distance of.

(5) Completing retrieval of query samples using a cross-modality retriever

For the search task of searching text for image, firstly, the calculated search task is carried out

Individual hamming distance

And sequencing according to the sequence from small to large, and then taking samples corresponding to the first K minimum distances in the text retrieval sample set as retrieval results. Similarly, for a search task of searching for an image for a text, first, the calculated results are compared

Individual hamming distance

And sorting according to the sequence from small to large, and then taking samples corresponding to the first K minimum distances in the image retrieval sample set as retrieval results.

The following describes the advantageous effects of the present invention with reference to specific experiments.

The relevant experiments carried out for the method of the invention were mainly carried out on the Pascal VOC 2007 dataset, which is first briefly described. The Pascal VOC 2007 dataset contains 9963 images belonging to 20 categories (e.g., airplane, bottle, horse, sofa, etc.), and each image is labeled. In the experiment, the inventive method partitioned the data set into a training set containing 5011 image-tag pairs and a test set containing 4952 image-tag pairs. For the depth cross-modality hash method, the image modality uses the original pixel features as input features. For the method with manual features as input, 512-dimensional GIST features are used as input features. For the text modality, 399-dimensional word frequency features are used as input features. Two main cross-modal retrieval tasks are performed in the experiment, namely: the text retrieval with the image and the image retrieval with the text are represented by Img2Txt and Txt2Img, respectively.

The performance of the cross-modal hash retrieval method is measured by using an Average Precision Average (MAP). To obtain the MAP, it is necessary to first calculate an Average Precision (AP) for each query sample. And after the average precision APs of all the query samples are obtained, averaging all the average precision APs to obtain an average precision mean MAP.

The method of the present invention uses a small Batch gradient descent algorithm with Momentum (Momentum) and Weight Decay (Weight Decay) of 0.9 and 0.0001, respectively, and the size of the Batch (Batch) is set to 128. The first five layers of the image modality deep neural network in the method of the present invention were initialized using AlexNet pre-trained on the ImageNet dataset. Other parameters of the deep neural network in the method are initialized in a random initialization mode. The output feature dimensions of the deep neural networks of the image modality and the text modality are set to 1024. In the experiment, 5-fold cross validation is adopted to determine the parameter gamma in the method of the invention₁And gamma₂The optimum value of (2). And for parameters in other methods, parameter setting is carried out according to a parameter setting principle recommended by each method, and the result reported by the experiment is the average value of the results of 10 random experiments.

The method for comparing with the method of the invention respectively comprises the following steps: semantic Correlation Maximization (SCM), Supervised Matrix Factorization Hashing (SMFH) method, Deep Cross-Modal Hashing (DCMH) method, and Pairwise Relationship-oriented Deep Hashing (PRDH) method. Table 1 lists the average precision mean MAP for the cross-modal hash search on the Pascal VOC 2007 data set by the present method and comparative method. As can be seen from table 1, for two search tasks, under three hash coding lengths, the deep cross-modal hash search methods DCMH, PRDH, and the method of the present invention all can achieve better search performance than the shallow cross-modal hash search methods SCM and SMFH. This illustrates that it is beneficial to use a deep learning technique to learn the depth features used to generate the binary hash code. From table 1, it can also be seen that for the Img2Txt and Txt2Img search tasks, the cross-modal search performance of the method of the present invention is superior to that of the DCMH and PRDH methods under three hash coding lengths. This shows that the method of the present invention is an effective cross-modal hash retrieval method.

TABLE 1 MAP of methods on Pascal VOC 2007 dataset

Claims

1. A cross-modal Hash retrieval method based on deep learning assumes a pixel feature vector set of image modalities of n objects as

Wherein c represents the number of object categories; for vector y_iIn other words, if the ith object belongs to the kth class, let vector y_iIs 1, otherwise, the vector y_iThe kth element of (1) is 0; the method is characterized by comprising the following steps:

(1) obtaining a binary hash code B shared by an image mode and a text mode by using a target function designed on the basis of a deep learning technology, and obtaining a deep neural network parameter theta of the image mode and the text mode_vAnd theta_tAnd projection matrix P for image modality and text modality_vAnd P_t；

(2) Solving the unknown variables B and theta in the objective function by using an alternative solving mode_v、θ_t、P_vAnd P_tI.e. solving the following three sub-problems alternately: fixed B, P_vAnd P_tSolving for theta_vAnd theta_t(ii) a Fixed B, theta_vAnd theta_tSolving for P_vAnd P_t(ii) a Fixed theta_v、θ_t、P_vAnd P_tSolving B;

(5) using a cross-modal retriever based on approximate nearest neighbor search to complete the retrieval of the query sample;

the target function designed based on the deep learning technology in the step (1) is in the form as follows:

wherein, γ₁And gamma₂Is a non-negative balance factor, B ═ B₁,b₂,…,b_n]^T∈{-1,+1}^n×k，

And

and

depth features of n objects in an image modality and a text modality, respectively, and vectors of ith columns of the matrix F and the matrix G are F (v) respectively_i；θ_v) And g (t)_i；θ_t)，

2. The deep learning-based cross-modal hash retrieval method according to claim 1, wherein the step (2) uses an alternative solution to solve the unknown variables B and θ in the objective function_v、θ_t、P_vAnd P_tSpecifically, the following three sub-problems are solved alternately:

(2) fixed B, theta_vAnd theta_tSolving for P_vAnd P_t(ii) a When the binary hash code B is fixed, and the deep neural network parameter theta_vAnd theta_tThe objective function shown in equation (1) is reduced to that about the projection matrix P_vAnd P_tThe sub-problems of (a):

(3) fixed theta_v、θ_t、P_vAnd P_tSolving B; when the depth of the neural network parameter theta is fixed_vAnd theta_tAnd a projection matrix P_vAnd P_tIn time, the objective function shown in equation (1) reduces to a sub-problem with binary hash coding B, namely:

when solving the unknown variable B in the formula (4), a discrete hash algorithm based on singular value decomposition is used for the solution.

3. The deep learning-based cross-modal hash retrieval method according to claim 1, wherein the depth neural network parameter θ based on the solved image modality and text modality in the step (3)_vAnd theta_tAnd a projection matrix P_vAnd P_tGenerating a binary hash code for the query sample and the samples in the search sample set, in particular, assuming that a feature vector of a query sample of the image modality is

Text modelA query sample of states has a feature vector of

The image mode searches the characteristics of the samples in the sample set as

The text modal search sample set is characterized by

and

wherein the content of the first and second substances,

sign (·) is a sign function.

4. The deep learning-based cross-modal hash retrieval method of claim 1, wherein the step (4) of calculating the hamming distance from the query sample to each sample in the retrieved sample set based on the generated binary hash code, specifically using a formula

Computing ith from image mode query sample to text mode retrieval sample set

Hamming distance of individual samples; using the formula

Computing the ith from the query sample of text mode to the image mode retrieval sample set

Hamming distance of individual samples.

5. The deep learning-based cross-modal hash retrieval method according to claim 1, wherein the step (5) uses a cross-modal retriever based on approximate nearest neighbor search to complete the retrieval of the query sample, specifically, the computed hamming distance

Or alternatively

And sequencing according to the sequence from small to large, and then taking samples corresponding to the first K minimum distances in a text mode or image mode retrieval sample set as retrieval results.