CN107330074B

CN107330074B - Image retrieval method based on deep learning and Hash coding

Info

Publication number: CN107330074B
Application number: CN201710525604.1A
Authority: CN
Inventors: 陈熙霖; 刘昊淼; 王瑞平
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-06-30
Filing date: 2017-06-30
Publication date: 2020-05-26
Anticipated expiration: 2037-06-30
Also published as: CN107330074A

Abstract

The invention relates to a model training method based on deep learning and Hash coding, which comprises the steps of taking part of labeled image data as training data of a network model, and expressing the training data as quasi-binary Hash coding through a deep network, wherein the quasi-binary Hash coding refers to a simulated binary Hash coding of which the value is a continuous value; connecting the obtained class binary hash codes as input to one or more task layers of the deep network, and training by using one or more tasks; and obtaining the binary hash code used for representing the training data and provided with the characteristic information available for retrieval based on the class binary hash code.

Description

Image retrieval method based on deep learning and Hash coding

Technical Field

The invention relates to the technical field of computer vision, in particular to an image retrieval method based on deep learning and Hash coding.

Background

With the development of science and technology, the world enters the big data era nowadays, especially, image data resources grow rapidly, so that searching large-scale image data to meet user requirements brings new challenges to the technical field of image searching. Content-Based Image Retrieval (CBIR) is receiving more and more attention than conventional Text-Based Image Retrieval (TBIR).

In the CBIR technology, how to effectively describe the features of images and how to perform fast similarity search is a research focus in recent years. Due to the superiority of deep neural networks in feature learning and the superiority of hash codes in computation speed and storage space in retrieval, image retrieval methods using deep convolutional neural networks, or hash techniques, or a combination of the two have emerged.

For example, an image retrieval method based on a deep network extraction feature utilizes a trained deep convolution network to extract features from an image, and performs image retrieval by calculating and ranking Euclidean distances between the features of a query image and the image features in a database. Reference is made to the paper "arm bannko, Anton Slesarev, Alexandr Chigorin, and victor lempitsky. neural Codes for Image retrieval. eccv 2014". On one hand, the method has the defects that the characteristics extracted by the method are real number vectors with high dimensionality, so that the storage cost and the calculated amount are high, and the requirement of rapid scale growth of the current network database cannot be met; on the other hand, the deep network for extracting the features by the method is not trained for the database data, the retrieval effect is seriously dependent on the similarity degree between the database data and the data used by the training network, and if the similarity degree is low, the retrieval effect is poor correspondingly.

In the prior art, there is also an image retrieval method based on a multi-vision attribute retrieval formula, which trains a joint classifier of a plurality of vision attributes to predict the vision attributes of an image by using the association between the vision attributes. And during retrieval, constructing a new retrieval formula according to the association between the retrieval formula given by the user and the known visual attribute, and retrieving according to the matching degree of the visual attribute of the image in the database and the retrieval formula. Reference is made to the paper "Behjat Siddiquie, Rogerio S.Feris, and Larry S.Davis.image Ranking and renewable based on Multi-AttributeQueries.CVPR 2011". On one hand, the method can not be directly used for other retrieval tasks because the training is only carried out on the visual attribute data during the training, thereby limiting the application prospect; on the other hand, when it is desired to increase visual attributes on a database, the jointly trained model cannot be directly extended to new visual attributes, and needs to be completely retrained from scratch, thereby limiting the scalability of the method.

In addition, chinese patent publication No. CN105512273A also discloses an image retrieval method based on variable-length deep hash learning, which trains a deep network by using an image triplet, and aims to enable the network to learn a binary hash code end to end, so that similar images have similar codes and dissimilar images have larger code differences. The method has the defects that on one hand, only one similarity measurement can be used in training, so that the finally obtained binary hash code can be only used for a single retrieval task, and the application range of the method is limited; on the other hand, the method uses the image triples during training, so that the model convergence is slow during training, and the training time is long.

Therefore, there is a need for a fast, efficient and scalable image retrieval method.

Disclosure of Invention

The invention aims to provide an image retrieval method based on deep learning and hash coding, which can overcome the defects of the prior art.

According to one aspect of the invention, a model training method based on deep learning and hash coding is provided, which comprises the following steps:

step 1), taking part of labeled image data as training data of a network model, and expressing the training data as a quasi-binary hash code through a depth network, wherein the quasi-binary hash code is a simulated binary hash code with a continuous value;

step 2), connecting the class binary hash codes obtained in the step 1) to one or more task layers of a deep network by taking the class binary hash codes as input, and training by using one or more tasks;

and 3) obtaining a binary hash code which is used for representing the training data and has the characteristic information available for retrieval based on the class binary hash code of the step 1).

Preferably, the one or more task layers of step 2) refer to task layers that can be used as image retrieval tasks.

Preferably, the image search task is to perform image search according to semantic categories of images.

Preferably, the semantic categories for the images may be trained using a classification task or a metric learning task based on image pairs.

Preferably, the image retrieval task is to perform image retrieval with respect to visual attributes of images.

Preferably, a set of visual property classifiers can be trained for the visual properties.

Preferably, by using the network model, label prediction can be performed on image data which is not completely labeled or is not labeled, so that attribute labels of all images in the image data are complemented.

According to another aspect of the present invention, there is provided a method for deep learning and hash coding based image retrieval according to any one of the above, including:

when a semantic category retrieval task is carried out according to a query image, a binary hash code of the query image is obtained by utilizing the network model; obtaining an image with the same semantic category as the query image as a retrieval result by comparing the binary hash code of the query image with the binary hash codes of all images in an image database; or

When visual attribute information of one or more query images is used as a retrieval task, restoring corresponding visual attribute information of all images in a database by using the network model according to binary hash codes of the images to obtain the images with the visual attribute information as retrieval results; or

When the semantic category of a query image and one or more appointed visual attribute information are used as retrieval tasks, firstly, the network model is used for restoring the corresponding visual attribute information of all images in a database according to the binary hash codes of the images, and the restored visual attribute information is used for screening the database; secondly, obtaining a binary hash code of the query image by using the network model; and comparing the binary hash codes of the query image with the binary hash codes of the images in the screened image database to obtain the images which have the same semantic type as the query image and have the specified visual attributes as retrieval results.

According to another aspect of the present invention, there is provided an image retrieval system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to perform the steps of the image retrieval method described above.

According to another aspect of the present invention, there is provided a computer-readable storage medium comprising a computer program stored on the readable storage medium, wherein the program performs the steps of the image retrieval method as described above.

Compared with the prior art, the invention has the following beneficial technical effects: compared with the existing image retrieval method taking the extracted real number vector as the characteristic, the image retrieval method based on deep learning and Hash coding greatly reduces the requirement of a retrieval system on storage space and the calculated amount when images are compared with each other, has good retrieval effect and fast model training, and can better meet the continuously-enlarged scale of the current network database; meanwhile, the image retrieval method provided by the invention can be used for aiming at a plurality of different retrieval tasks, and has wide application prospect and good expandability.

Drawings

Fig. 1 is a schematic diagram of an overall flow framework of an image retrieval method based on deep learning and hash coding according to the present invention;

fig. 2 is a schematic diagram of the related search application of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, an image retrieval method based on deep learning and hash coding provided in an embodiment of the present invention is described below with reference to the accompanying drawings.

In the field of image retrieval, the deep learning can combine the characteristics of the bottom layer of an image to form higher-level representation, such as attribute categories and other characteristics, so as to find the distributed characteristic representation of image data; the Hash coding is an algorithm with quick query capability and low memory overhead, and in the field of image retrieval, the Hash coding can be used for expressing image contents into a binary Hash sequence and expressing the characteristics of an image by using the sequence.

The inventors have studied carefully and have proposed an image retrieval method using a deep neural network, which can learn a binary hash code as a feature representation of an image end to end. The method is characterized in that training is carried out by utilizing part of labeled image data, a plurality of loss functions aiming at different retrieval tasks are adopted, and different information is embedded into the binary hash code, so that the final binary hash code can be used for a plurality of different retrieval tasks.

In one embodiment of the invention, an image retrieval method based on deep learning and hash coding is provided, and the method mainly comprises data preparation, model training and image retrieval.

Fig. 1 is a schematic diagram of an overall flow chart of an image retrieval method based on deep learning and hash coding provided by the present invention, and as shown in fig. 1, the image retrieval method based on deep learning and hash coding of the present invention includes the following steps:

s10, data preparation

In order to use binary hash coding as a representation of an image, a large amount of image data resources are required to train a model in a deep neural network front end. In the training phase, the image data resource with partial labels is adopted, where the labels refer to feature labels of the images, for example, the labels represent objects such as cats, dogs, automobiles, etc. contained in the images, or the shapes, colors, materials, etc. of the objects, and these feature labels may be labels (tags) of the images themselves, for example, the images on the image sharing website, or may be obtained through labeling at a later stage. For ease of understanding, the following description will be given by taking as an example both semantic categories (e.g., cat, dog, table, etc.) and visual attributes (e.g., red, circle, blob, etc.).

For different semantic categories, if the semantic category of an image is uncertain or the semantic category label is unknown, the corresponding semantic category label is unknown, as labeled as "? "is selected from the list; if the semantic category labels are known, labeling according to the category to which the semantic category belongs, such as an entry labeled "big white shark, hot balloon" in the training data chart shown in fig. 1, where the category may be labeled, for example, for n categories, the category to which the semantic category belongs is indicated by a positive integer greater than or equal to 1 and less than or equal to n;

for different types of visual attributes, if an image has one visual attribute, the visual attribute of the image is marked as a positive example, if the image does not have the visual attribute, the visual attribute of the image is marked as a negative example, if the visual attribute is not determined to have the visual attribute, or the marking of the visual attribute is not provided, the visual attribute of the image is marked as unknown, such as corresponding marking in a training data chart shown in fig. 1, black, stripes and tails of the first row of images are positive examples, circles and woods are negative examples, the visual attribute of the second row of images is unknown, and the like. The visual properties of each image can thus be labeled as a multi-dimensional vector, e.g., for m visual properties, each image has one visual property labeling vector of m dimensions.

In data preparation, the image data resource with partial labels adopted by the invention means that as long as at least one of semantic category labels and visual attribute labels of an image is not unknown, the image can be applied to a training model. By adopting the mode, on one hand, a large amount of available image data resources can be fully utilized, and the overfitting problem of the model is relieved; on the other hand, the application scene of the method can be expanded, so that the method can be used for various different retrieval tasks, such as semantic category retrieval, visual attribute retrieval or semantic category and visual attribute combined retrieval and the like.

S20, model training

After the data preparation of step S10 is completed, the training image data that is completely prepared may be subjected to a series of non-linear operations in the depth network N, such as convolution, pooling, full connection of the depth Convolutional Neural Network (CNN), so as to obtain a multi-dimensional image feature representation f that takes a real number as a value.

The image feature representation f of the real number is then subjected to a non-linear operation again. Because the gradient return operation is required in the training process of the network, the two-valued hash code can be simulated by using a conductible activation function instead of training after binary quantization of the image feature representation f by directly using an unguided step function. When the S-shaped activation function is used, the value of each dimension of the feature f can be compressed to a limited range, for example, the range of 0 to 1 when the sigmoid function is used, and the range of-1 to 1 when the hyperbolic tangent function is used;

in another embodiment of the present invention, the feature representation f may also be binarized using a regularization term, for example, to constrain the output value as close to ± 1 as possible.

After the above operation is completed, a binary-like hash code C with a dimension k, for example, can be obtained₀In which C is₀Is equal to the code length of the final binary hash code.

The values of the binary hash codes are known to be strictly binary, such as 0 and 1, or-1 and 1. The binary hash code is an image feature code simulating binary hash coding, wherein the value of the image feature code is a continuous value, for example, a real number from 0 to 1 or a real number from-1 to 1.

After obtaining the above-mentioned binary Hash code C₀Then, adding C₀As input, to different task layers that can be image retrieval tasks, multiple tasks are used simultaneously for training. The following description will take two search tasks, semantic category and visual attribute, as an example:

for semantic categories, a classification task or a metric learning task based on image pairs may be used for training. When a classification task is used, it is assumed that a task layer comprises n nodes which respectively correspond to n semantic categories. When the class label of the image is known, a loss function for the classification, for example, using a softmax loss function or a hinge loss function, can be utilized to measure the accuracy of the classification; when the image class is unknown, the corresponding sample is ignored in the classification task. Class binary hash coding C with similarity by constraining samples of the same semantic class when using metric learning tasks₀Different types of sample codes have larger difference, so that class binary Hash codes suitable for retrieval tasks are learned;

for visual attributes, corresponding visual attribute information may be implicit in the binary hash code by training a set of visual attribute classifiers. For example, the task layer includes m nodes corresponding to m kinds of visual attributes. When a certain visual attribute of the image is known, predicting loss by using a weighted attribute, such as sigmoid cross entropy loss, change loss and the like, and measuring the prediction accuracy of the sample on the visual attribute; when a visual attribute label of an image is unknown, the corresponding sample is ignored on the corresponding visual attribute classifier. By using the loss of weighting and applying different weights to the positive and negative samples, the problem of prediction deviation caused by unbalanced proportion of the positive and negative samples can be relieved to a certain extent. And after the loss calculation is finished, calculating a corresponding gradient value by derivation, and updating the parameters of the network model through a back propagation algorithm. And after repeated iteration updating, finishing the training of the network model.

Class-two-valued Hash code C for extracting database image by using deep network N₀Then, it needs to use a threshold value to quantize it, and get the real binary hash code C. The threshold value here may be fixedly set, for example, 0.5 or 0, or may be obtained by learning. Meanwhile, the parameters a of the visual attribute classifier, for example, the matrix with size k × m, are saved for the subsequent visual attribute retrieval task.

S30, image retrieval

When a user needs to perform image retrieval, the class binary Hash code C of the query image can be calculated through the deep network N firstly₀Then, the class binary hash code C of the query image is encoded by using a threshold value that coincides with the database image set or learned in the above-described step S20₀The quantization into the binary hash code C is performed for various retrieval tasks, and the semantic category retrieval task, the visual attribute retrieval task, and the semantic category and visual attribute joint retrieval will be described as an example.

The semantic category retrieval means that when a user gives a query image, images with the same semantic category need to be retrieved from an image database, for example, fig. 2 is a schematic diagram of a related retrieval application of the present invention, and as shown in the first line of fig. 2, when a user gives an image of an automobile, all images including the automobile need to be retrieved from the image database. To implement this function, the semantic category search may be accomplished by comparing the binary hash code of the query image containing the car with the binary hash codes of the images in the database, e.g., by calculating the hamming distance between the binary hash code of the query image containing the car and the binary hash codes of the images in the database, and returning the search results in order of decreasing distance;

the visual attribute retrieval means that a user designates one or more visual attributes as a retrieval formula, and an image having the designated visual attributes needs to be retrieved from an image database. For example, as shown in the second row of FIG. 2, given the visual attributes "white" and "metallic," the user needs to retrieve all images in the image database that contain the visual attributes. In order to realize the function, the internal product of the binary hash code C of the database image and the visual attribute classifier A can be calculated, for example, the internal product of the binary vector C and the visual attribute classifier A is calculated by using a table look-up method without a large number of multiplication operations, so as to restore the visual attribute information of the image, thereby completing the visual attribute retrieval, for example, the visual attribute is sorted by using the probability of the visual attribute;

the joint retrieval of semantic categories and visual attributes means that a user designates a query image and designates one or more visual attributes, and images which have the same semantic categories as the query image and have the visual attributes designated by the user need to be retrieved in an image database. For example, as shown in the third row of FIG. 2, given an image of a car and the visual attribute "Red", the user needs to retrieve all images in the image database that contain the semantic category labeled as car and the visual attribute labeled "Red". In order to realize the function, firstly, the same method as that used in the visual attribute retrieval is used, the restored visual attributes are used for screening all database images, and for example, the images with the visual attributes lower than the threshold are removed according to a certain threshold; then, the two-value hash codes of the query image and the screened database image are compared by the same method as that used for semantic category retrieval, and retrieval is completed, for example, after the query image and the screened database image are sorted by calculating Hamming distance, the screened database image is returned as a retrieval result in the order of the distance from small to large.

In another embodiment of the present invention, the trained network model in step S20 may be used to perform label prediction on image data with incomplete labeling or no labeling, so that the attribute label of the image data may be complemented and added to the database.

Although in the above embodiments, the semantic category and the visual attribute are used to describe the image retrieval method based on the deep learning and the hash coding, it should be understood by those skilled in the art that in other embodiments, the training method for the semantic category or the visual attribute described in the above embodiments may be used to implement the image retrieval method provided by the present invention by using other image features as the retrieval task and the image annotation according to different requirements, for example, training is performed by using a training model similar to the above semantic category as other annotation information having a certain correlation with the image, such as a place where the image is taken or made, or training is performed by using a training model similar to the above visual attribute as a positive example when a photographic work of a certain style is screened.

Compared with the prior art, the image retrieval method based on deep learning and Hash coding provided by the embodiment of the invention has the advantages that the end-to-end learning binary Hash coding of the deep neural network is used as the image representation, the storage overhead and the retrieval calculation amount are greatly reduced, different image characteristic information is embedded into the binary Hash coding by using the loss function which does not depend on an image triplet, the convergence speed of the network is improved, the final binary Hash coding can be used for various different retrieval tasks, meanwhile, separate classifiers are trained aiming at different attributes of images, the overlarge dependency relationship among the attribute classifiers is avoided, and the expandability of a training model is ensured.

Although the present invention has been described by way of preferred embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims

1. A model training method based on deep learning and Hash coding comprises the following steps:

step 1), taking part of labeled image data as training data of a network model, and expressing the training data as class binary hash codes through a depth network, wherein the class binary hash codes refer to analog binary hash codes with continuous values, and the part of labeled image data refers to image data with at least one of semantic class labels and visual attribute labels which are not unknown;

step 2), connecting the class binary hash codes obtained in the step 1) to a plurality of task layers of a deep network as input, and training by using a plurality of tasks, wherein the plurality of task layers are task layers capable of being used as image retrieval tasks, and the image retrieval tasks comprise: performing image retrieval aiming at the semantic category of the image and performing image retrieval aiming at the visual attribute of the image;

and 3) obtaining a binary hash code which is used for representing the training data and has characteristic information available for retrieval based on the class binary hash code of the step 1).

2. The deep learning and hash coding based model training method of claim 1, for semantic classes of the images, training using a classification task or a metric learning task based on image pairs.

3. The deep learning and hash coding based model training method of claim 1, wherein a set of visual attribute classifiers is trained for the visual attributes.

4. The deep learning and hash coding based model training method according to any one of claims 1 to 3, wherein label prediction is performed on image data with incomplete or no label by using the network model, so as to complement the attribute labels of all images in the image data.

5. A method of deep learning and hash coding based image retrieval employing the method of any of claims 1 to 4, comprising:

When a semantic category of a query image and one or more appointed visual attribute information are used as retrieval tasks, firstly, the network model is used for restoring the corresponding visual attribute information of all images in a database according to binary hash codes of the images, and the restored visual attribute information is used for screening the database; secondly, obtaining a binary hash code of the query image by using the network model; and comparing the binary hash codes of the query image with the binary hash codes of the images in the screened image database to obtain the images which have the same semantic type as the query image and have corresponding visual attributes as retrieval results.

6. An image retrieval system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to perform the steps of claim 5.

7. A computer-readable storage medium comprising a computer program stored on the readable storage medium, wherein the program performs the steps of claim 5.