CN114547365A

CN114547365A - Image retrieval method and device

Info

Publication number: CN114547365A
Application number: CN202210163348.7A
Authority: CN
Inventors: 姜光杰; 丰振航; 刘祥; 郝旭宁; 夏婷; 刘微
Original assignee: Hisense TransTech Co Ltd
Current assignee: Hisense TransTech Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-05-27

Abstract

The invention discloses an image retrieval method and equipment, wherein the method comprises the following steps: acquiring an initial image set to be retrieved; aiming at each image to be retrieved in the initial image set to be retrieved, determining the floating point characteristic and the Hash characteristic of the image to be retrieved based on a target image retrieval model; comparing the hash characteristics of each image to be retrieved with the reference hash characteristics of the reference images, and screening the initial image set to be retrieved to obtain a first image set to be retrieved; and processing by using the floating point characteristics of each first image to be retrieved in the first image set to be retrieved and the reference floating point characteristics of the reference image, and screening the first image set to be retrieved to obtain a target image set. For achieving a fast and accurate image retrieval.

Description

Image retrieval method and device

Technical Field

The invention relates to the technical field of big data processing, in particular to an image retrieval method and device.

Background

With the continuous and deep application of big data technology and artificial intelligence analysis technology in the public safety field, the application of front-end intelligent snapshot cameras is more and more extensive, and how to obtain related images from massive snapshot images according to the requirements of users is particularly important.

In the related art, the snapshot images in the database are generally sorted according to the one-by-one distance measurement between the snapshot image and the standard image in the database in the feature space, and then the image retrieval result is returned according to the distance. However, there are typically millions of images in a common database, which is inefficient in retrieval.

In addition, in the related art, a manner of extracting shallow feature information of an image has also appeared, and the image is searched by comparing the shallow feature information. However, in this method, the features learned by the network are small, and the method is only suitable for the case where the number of image types is small, and the retrieval accuracy is low in the monitoring scene with a complicated scene.

Therefore, in the face of massive snapshot data, how to realize fast and accurate image retrieval is very important.

Disclosure of Invention

An exemplary embodiment of the present invention provides an image retrieval method and apparatus to implement fast and accurate image retrieval.

According to a first aspect of the exemplary embodiments, there is provided an image retrieval method, including:

acquiring an initial image set to be retrieved;

aiming at each image to be retrieved in the initial image set to be retrieved, determining the floating point characteristic and the hash characteristic of the image to be retrieved based on a target image retrieval model;

comparing the hash features of the images to be retrieved with the reference hash features of the reference images, and screening the initial image set to be retrieved to obtain a first image set to be retrieved;

and processing the floating point characteristics of each first image to be retrieved in the first image set to be retrieved and the reference floating point characteristics of the reference image, and screening the first image set to be retrieved to obtain a target image set.

According to the method and the device for searching the image, the initial image set to be searched may include a large number of images to be searched, and therefore after the initial image set to be searched is obtained, the floating point characteristics and the hash characteristics of each image to be searched are determined based on the target image searching model. And then, comparing the hash features of the images to be retrieved with the reference hash features of the reference images to perform first screening to obtain a first image set to be retrieved. Then, the floating point feature of each first image to be retrieved in the first image set to be retrieved and the reference floating point feature of the reference image are applied for processing, the floating point feature is usually a high-dimensional floating point feature, the processing process applying the floating point feature is a processing process utilizing cosine similarity, the screening process is more accurate and is a fine screening process, and the process outputs accurate image data which are adjacent to the reference image. Therefore, the image retrieval method of the embodiment of the application can not only ensure the retrieval precision, but also improve the retrieval speed.

According to a second aspect of the exemplary embodiments, there is provided an electronic device comprising a processor and an image acquisition unit, wherein:

the image acquisition unit is configured to perform:

acquiring an initial image set to be retrieved;

the processor is configured to perform:

According to a third aspect of the exemplary embodiments, there is provided an image retrieval apparatus including:

the image acquisition module is used for acquiring an initial image set to be retrieved;

the characteristic determining module is used for determining the floating point characteristic and the Hash characteristic of each image to be retrieved in the initial image set to be retrieved based on a target image retrieval model;

the first screening module is used for comparing the hash features of the images to be retrieved with the reference hash features of the reference images and screening the initial image set to be retrieved to obtain a first image set to be retrieved;

and the second screening module is used for applying the floating point characteristics of each first image to be retrieved in the first image set to be retrieved and the reference floating point characteristics of the reference image to process, and screening the first image set to be retrieved to obtain a target image set.

According to a fourth aspect of the exemplary embodiments, there is provided a computer storage medium having stored therein computer program instructions which, when run on a computer, cause the computer to perform the image retrieval method according to the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic diagram illustrating an application scenario of image retrieval according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating an image retrieval method provided by an embodiment of the invention;

FIG. 3 is a flow chart illustrating another image retrieval method provided by the embodiment of the invention;

FIG. 4 is a flow chart illustrating a training process of a target image retrieval model according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a training process of another target image retrieval model provided by an embodiment of the invention;

FIG. 6 is a schematic structural diagram illustrating an exemplary target image retrieval model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram schematically illustrating an image retrieval apparatus according to an embodiment of the present invention;

fig. 8 schematically illustrates a structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

For convenience of understanding, terms referred to in the embodiments of the present application are explained below:

(1) hamming distance: the method is used in data transmission error control coding and identifies different numbers of corresponding bits of two words with the same length. And carrying out exclusive OR operation on the two character strings, and counting the number of 1, wherein the number is the Hamming distance.

(2) Cosine distance: cosine of the angle between the two vectors.

(3) ResNet: the Network is an abbreviation of Residual Network, and the series of networks are widely used in the fields of object classification and the like, and typical networks are ResNet50, ResNet101 and the like.

In the actual application process, the application of the front-end intelligent snapshot camera is more and more extensive, and how to obtain related images from massive snapshot images according to the requirements of users is particularly important.

The image retrieval method in the related art has low retrieval efficiency or low retrieval accuracy.

Therefore, the embodiment of the application provides an image retrieval method, and in the method, for each image to be retrieved in an initial image set to be retrieved, a floating point feature and a hash feature of the image to be retrieved are determined based on a target image retrieval model. Comparing the hash characteristics of each image to be retrieved with the reference hash characteristics of the reference images, and screening the initial image set to be retrieved to obtain a first image set to be retrieved; the screening process is a process of performing coarse screening by utilizing the Hash characteristic, so that the range of performing fine screening by utilizing the floating point characteristic subsequently is reduced, and the retrieval speed is improved. Then, processing by using the floating point characteristics of each first image to be retrieved in the first image set to be retrieved and the reference floating point characteristics of the reference image, and screening the first image set to be retrieved to obtain a target image set; the screening process is a fine screening process utilizing floating point characteristics, and the retrieval precision is guaranteed. Therefore, the image retrieval method provided by the application realizes quick and accurate image retrieval.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

Fig. 1 is a schematic view of an application scenario of an image retrieval method according to an embodiment of the present application. The method can be applied to related services of public safety, intelligent communities or smart cities, for example, traffic scenes are taken as examples, and the bayonet cameras installed at each intersection are taken as image acquisition equipment. Therefore, a large number of images are collected, and for example, in order to analyze the traffic flow conditions of each intersection in different time periods, a specific image needs to be screened out to determine the traffic flow conditions.

The application scenario includes a plurality of image capturing devices 101 (including image capturing device 101-1, image capturing device 101-2, … … image capturing device 101-n), image retrieving device 102. The image capturing device 101 and the image retrieving device 102 are connected via a wireless or wired network, and the image capturing device 101 may be a camera or a bayonet camera, for example. The image retrieval device 102 may be one image retrieval device, an image retrieval device cluster composed of several image retrieval devices, or a cloud computing center. The image retrieval device 102 may be an independent physical image retrieval device, an image retrieval device cluster or a distributed system formed by a plurality of physical image retrieval devices, or a cloud image retrieval device providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform.

Each image acquisition device 101 sends the acquired massive images to the image retrieval device 102, and the image retrieval device 102 stores reference images (for determining which images are selected out) in advance, so that the image retrieval device 102 can analyze floating point characteristics and hash characteristics of the acquired images, and then performs coarse screening on the massive images through the hash characteristics, thereby reducing the number of images subjected to fine screening, improving the retrieval speed, and performing fine screening by using the floating point characteristics, so that the retrieval accuracy is ensured.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide method steps as shown in the following embodiments or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application.

The following describes, with reference to the application scenario shown in fig. 1, a flowchart of an image retrieval method shown in fig. 2, a technical solution provided in an embodiment of the present application.

S201, acquiring an initial image set to be retrieved.

S202, aiming at each image to be retrieved in the initial image set to be retrieved, determining the floating point characteristic and the hash characteristic of the image to be retrieved based on a target image retrieval model.

S203, comparing the hash features of the images to be retrieved with the reference hash features of the reference images, and screening the initial image set to be retrieved to obtain a first image set to be retrieved.

S204, processing by using the floating point characteristics of each first image to be retrieved in the first image set to be retrieved and the reference floating point characteristics of the reference image, and screening the first image set to be retrieved to obtain a target image set.

Referring to S201, in an actual application scenario, a study object is, for example, a range of a city a and a region B, and all the bayonet cameras in the range are image acquisition devices. In a specific example, for example, when a user needs to acquire a traffic flow situation in the area range in a past week to determine a difference between the acquired images of the vehicles and the effect of the images in the clear situation, images acquired by all the cameras at the checkpoint in the week may be acquired, which is referred to as an initial image set to be retrieved. And then screening out a desired target image set from the initial image set to be retrieved.

Referring to S202, an image generally contains much information, and in general, information extraction may be performed on the image to obtain information for characterizing each dimension of the image. In the actual application process, the methods for extracting the images are different, and the information amount and the information content of the obtained images are different.

In order to find an image desired by a user in the initial image set to be retrieved, a reference image needs to be preset according to the requirements of the user, and then an image close to the reference image in the initial image set to be retrieved is found.

And aiming at each image to be retrieved in the initial image to be retrieved, determining the floating point characteristic and the hash characteristic of the image to be retrieved based on a pre-trained target image retrieval model. The hash features can be called as short features and are used for roughly screening the initial image set to be retrieved; the floating point feature may be referred to as a long feature and is used to fine-screen the coarsely screened image set.

Specifically, each image to be retrieved is used as an input of the target image retrieval model, the output before the full connection layer in the target image retrieval model is determined as the floating point feature of the image to be retrieved, and the output of the hash layer in the target image retrieval model is determined as the hash feature. The target image retrieval model can be a pre-trained neural network model, and the training process is explained later.

And S203, after the floating point characteristic and the hash characteristic of each image to be retrieved in the initial image set to be retrieved are obtained, comparing the hash characteristic of each image to be retrieved with the reference hash characteristic of the reference image, and screening the initial image set to be retrieved to obtain a first image set to be retrieved.

Since the comparison process of the hash features is relatively computationally inexpensive, the screening process may be referred to as coarse screening. Illustratively, the specific implementation of the coarse screening process is as follows:

determining a Hamming distance between the Hash characteristics of the images to be retrieved and reference Hash characteristics aiming at each image to be retrieved; screening an initial image set to be retrieved by applying the size relation between the Hamming distance corresponding to each image to be retrieved and a preset Hamming distance threshold; and determining a set formed by the images to be retrieved corresponding to the Hamming distances smaller than the preset Hamming distance threshold as a first image set to be retrieved.

Specifically, the hash feature and the floating point feature of the reference image can be determined by using the reference image as an input through the target image retrieval model, and for clarity, the hash feature of the reference image is referred to as the reference hash feature, and the floating point feature of the reference image is referred to as the reference floating point feature.

During rough screening, for each image to be retrieved, a hamming distance between the hash feature of the image to be retrieved and the reference hash feature is determined, and how to calculate the hamming distance according to the hash feature may refer to methods in the related art, which are not described herein. A preset Hamming distance threshold is preset, and in the setting process of the preset Hamming distance threshold, on one hand, the retrieval precision and the retrieval speed are considered, and a compromise value is selected as the preset Hamming distance threshold. The larger the obtained hamming distance is, the lower the similarity between the image to be retrieved and the reference image is, so that the images to be retrieved corresponding to the hamming distances which are greater than or equal to the preset hamming distance threshold are deleted, and the set formed by the images to be retrieved corresponding to the hamming distances which are less than the preset hamming distance and are left in the initial image set to be retrieved is the first image to be retrieved.

The comparison process of the hash characteristics is a rapid comparison process through bit operation, and the calculation amount of the process is relatively small and the speed is high, so that the calculation force consumed by the coarse screening process is low, the coarse screening process can be used for removing images to be retrieved with large difference from the reference image, and further screening out images to be retrieved in a small range, namely, the cardinality of an image set in the next screening process is reduced, and the speed of the subsequent retrieval process is improved.

Referring to S204, after the rough screening is performed, the number of the obtained first images to be retrieved is reduced compared with the initial image set to be retrieved. And then, processing the floating point characteristics of each first image to be retrieved in the first image set to be retrieved and the reference floating point characteristics of the reference image, and screening the first image set to be retrieved to obtain a target image set.

Since the process of matching floating point features can achieve precise matching between images, the screening process can be called fine screening. Illustratively, the specific implementation of the fine screening process is as follows:

determining a cosine distance between a floating point feature and a reference floating point feature of the first image to be retrieved aiming at each first image to be retrieved in the first image to be retrieved set; screening the first image set to be retrieved by applying the size relation between the cosine distance corresponding to each first image to be retrieved and a preset cosine distance threshold; and determining each first image to be retrieved corresponding to each cosine distance greater than a preset cosine distance threshold value to form a target image set.

Specifically, for each image to be retrieved, the cosine distance between the floating point feature of the image to be retrieved and the reference floating point feature is determined, and how to calculate the cosine distance according to the floating point feature may refer to the method in the related art, which is not described herein. A preset cosine distance threshold is preset, and in the setting process of the preset cosine distance threshold, on one hand, the retrieval precision and the retrieval speed are considered, and a compromise value is selected as the preset cosine distance threshold. The larger the obtained cosine distance is, the higher the similarity between the image to be retrieved and the reference image is, so that the image to be retrieved corresponding to each cosine distance smaller than or equal to the preset cosine distance threshold is deleted, and a set formed by the images to be retrieved corresponding to each cosine distance larger than the preset cosine distance in the initial image set to be retrieved is taken as the target image to be retrieved.

Since the floating point feature is usually a high-dimensional floating point feature, the process of calculating the cosine distance of the fine screen is a process of measuring by using cosine similarity, and the process outputs image data close to the reference image.

Therefore, the image retrieval method of the embodiment of the application can not only ensure the retrieval precision, but also improve the retrieval speed.

In order to further improve the technical solution of the present application, a complete flowchart is used to describe the image retrieval method in the embodiment of the present application, and with reference to fig. 3, the method at least includes the following steps:

s301, acquiring an initial image set to be retrieved.

S302, inputting each image to be retrieved in the initial image set to be retrieved to a target image retrieval model, determining the output of the target image retrieval model before the full connection layer as the floating point feature of the image to be retrieved, and determining the output of the hash layer in the target image retrieval model as the hash feature.

S303, aiming at each image to be retrieved, determining the Hamming distance between the Hash characteristics of the image to be retrieved and the reference Hash characteristics.

S304, screening the initial image set to be retrieved by applying the size relation between the Hamming distance corresponding to each image to be retrieved and a preset Hamming distance threshold value.

S305, determining a set formed by images to be retrieved corresponding to the Hamming distances smaller than a preset Hamming distance threshold value as a first image set to be retrieved.

S306, aiming at each first image to be retrieved in the first image to be retrieved set, determining the cosine distance between the floating point feature of the first image to be retrieved and the reference floating point feature.

S307, screening the first image set to be retrieved by applying the size relation between the cosine distance corresponding to each first image to be retrieved and a preset cosine distance threshold.

S308, determining each first image to be retrieved corresponding to each cosine distance greater than a preset cosine distance threshold value to form a target image set.

According to the embodiment of the application, the trained target image retrieval model is utilized, the Hash characteristic and the floating point characteristic of each image to be retrieved are firstly determined, then the Hamming distance of the reference Hash characteristic of each image to be retrieved and the Hamming distance of the reference Hash characteristic of the reference image are determined, the similarity of each image to be retrieved and the reference image is judged through distance measurement, the image with high similarity is returned, and the first step of rough screening of the image is completed. And determining the cosine distance between the floating point characteristic of the image obtained by rough screening and the reference negative point characteristic of the reference image, judging the similarity between each image to be retrieved and the reference image through distance measurement, returning the image with high similarity, and completing the retrieval of the image.

As described above for the description of the image retrieval process in the embodiment of the present application, in the retrieval process, the floating point feature and the hash feature of each image to be retrieved are applied, and the determination of the floating point feature and the hash feature is output by the target image retrieval model when each image to be retrieved is input as the target image retrieval model. Referring to fig. 4, the following describes a training process of the target image retrieval model:

s401, acquiring a training sample set; wherein the training sample set comprises a plurality of training images.

Specifically, the training sample set may be a plurality of training images acquired by the image acquisition device within a certain time range. In addition, the training sample set is preprocessed, and the labels of the training set are set as one-hot labels, for example, the training sample set is represented as T ═ T₁,t₂,…,t_n]，t_iE {0,1}, n is the classification number.

S402, constructing an initial deep learning network based on the category of each training image in the training sample set; wherein, the initial deep learning network model is added with a full connection layer after the Conv5_ x layer of the basic ResNet50 network.

Specifically, an initial deep learning network for learning floating-point features is constructed, for example, a ResNet50 network can be used as a basic part of floating-point feature learning, and a full-connection layer is added after the last Conv5_ x layer of the network for classification.

S403, setting a first loss function, training the initial deep learning network model based on the training sample set, and determining a first parameter set and a trained initial deep learning network model based on the first parameter set.

The initial parameter configuration of the initial deep learning network is shown in table 1.

TABLE 1 configuration and parameter Table for initial deep learning network for extracting floating point features

For example, parameters in a ResNet50 model pre-trained on an imagenet database may be used to perform parameter initialization on the convolution layer of the deep hash network, and parameters of the full connection layer of the initial deep learning network are randomly initialized according to a gaussian distribution with a mean value of 0 and a variance of 0.001.

Specifically, the first loss function is constructed as follows:

loss＝αsoftmax loss+βcenter loss+γtriple loss；

wherein, loss is a first loss function, softmax loss is a softmax loss function, center loss is a center loss function, triple loss is a triple loss function, and α, β and γ are respectively corresponding adjustment parameters of the corresponding loss functions.

In one specific example, the definition of each loss function is as follows:

(1) the softmax loss function is defined as follows:

where z is a set of results output by a fully connected layer, and j is the number of layers included in the fully connected layer.

The softmax loss function maximizes the probability that the classification is correct, thereby improving the accuracy of the classification.

(2) The center loss function is defined as follows:

denotes the y th_iCenter of features of individual classes, x_iRepresents the characteristics before the fully connected layer, and m represents the size of the mini-batch.

The center loss function takes the distance between the feature center and the feature as a loss function, in order to make the inter-class distance larger and the feature space distribution more uniform, the feature loss function is embodied in such a way that the intra-class distance of the same class can be further reduced, the inter-class distance is increased, and the similarity is improved.

(3) The triple loss function triple loss is defined as follows:

triple loss＝max(d(a，p)-d(a，n)+margin，0)

the triple loss function can make the target distances of the same category closer and the target distances of different categories farther, that is, the final optimization target is the distance of a, p and the distance of a, n.

Thus, the first loss function is:

loss＝αsoftmax loss+βcenter loss+γtriple loss。

and adjusting parameters alpha, beta and gamma according to a cross validation principle. The set of all the determined parameters including the parameters α, β, γ is referred to as a first parameter set, and thus, the trained initial deep learning network model can be determined based on each parameter in the first parameter set. Based on the parameter settings in table 1, the output of the trained initial deep learning network model is the 2048-dimensional floating point feature corresponding to the input image.

S404, adding a hash layer after the Conv5_ x layer of the trained initial deep learning network model to obtain a second deep learning network model; wherein the hash layer is a fully connected layer having a preset number of neurons.

Specifically, since the second deep learning network model is modified based on the trained initial deep learning network model, that is, a hash layer is added after the Conv5_ x layer, the basic network of the second deep learning network model is also the ResNet50 network. Illustratively, the hash layer is also a fully connected layer, having a preset number of neurons, related to the settings of the hash layer, to map the output of the penultimate layer to hamming space.

S405, setting a second loss function, keeping all parameters in the first parameter set unchanged, training a second deep learning network model based on the training sample set, and determining a second parameter set and a target image retrieval model based on the second parameter set; the second parameter set comprises parameters in the first parameter set and parameters of the hash layer.

In the training process of the second deep learning network model, because both the second deep learning network and the initial deep learning network have the ResNet50 infrastructure, parameters (parameters in the first parameter set) of other layers of the ResNet50 are kept unchanged, the calculated amount in the model training process can be reduced, and the training speed is improved.

In the training process of the second deep learning network model, a relaxation strategy is adopted, the tanh activation function is selected, and the output { -1,1}¹²⁸And (4) hashing the feature vector, and selecting a measurement loss function, a loss function determined based on a regularization loss function and a classification loss function and a quantization loss function for training by adopting an RMSprop optimizer and a CosAnnealinGLR learning rate adjustment strategy. The main purpose is to keep good inter-class distance after mapping similar high-dimensional features to a low-dimensional Hamming space, and meanwhile, the distances of similar classes are close and the distances of different classes are far away. The method can ensure that a compact binary code is formed from a high-dimensional vector space to a low-dimensional Hamming space as much as possible, and ensures the similarity of retrieval.

Specifically, the initial parameter configuration of the second deep learning network model is shown in table 2.

Table 2 configuration and parameter table of second deep learning network model for extracting hash feature

For example, parameters in a pretrained ResNet50 model on the Market-1501 database can be used to perform parameter initialization on the convolution layer of the second deep learning network model, and parameters of the full-link layer are randomly initialized according to a Gaussian distribution with a mean value of 0 and a variance of 0.01.

Wherein the second loss function is as follows:

F＝J+μQ+λC；

wherein F is a second loss function, mu is a trade-off parameter, and lambda is a regular parameter;

(1) j is a metric loss function which may be chosen, for example, as a maximum likelihood function or a cross-entropy function cross-entropy. The method can be used for learning the hash features and is determined according to the similarity measurement of the hash features corresponding to every two training images in the training sample set.

Specifically, the process of determining J is illustrated:

since a sample may belong to multiple classes, a similarity measure for the hash code is calculated, as follows:

wherein S ═ { S ═ S_ij}＝T*T^T，s_ijE.g. 0,1, a similarity matrix of one-hot labels, and if the semantics of the two hash codes are similar, s_ij1, otherwise, 0;

b_i，b_jhash codes of i-class images and j-class images, and Hamming distance dist of the i-class images and the j-class images_H(. phi.) and inner product<b_i，b_j>The relationship is as follows:

wherein, K is a preset constant, if the inner product of the two hash codes is small, the Hamming distance between the two hash codes is large, and vice versa. It can be seen that the larger the inner product, the larger p (1| B), which means B_i，b_jShould be classified as similar.

In view of the similarity measure, as a hash function for learning hash codes, the following will be used:

the formula is a negative log-likelihood function that minimizes the hamming distance between two similar points, while maximizing the hamming distance between two different points.

(2) Q is a loss function determined based on the canonical loss function and the classification loss function. The classification loss function is a function for establishing a relationship between the learned hash features and the real label vectors, and is determined according to the classification weight of the training sample set, the real vector labels and the hash matrix output by the batch images in the training sample set.

Specifically, the process of determining Q is illustrated:

although the hash function represented by formula J is learned using paired label information, the label information is not fully utilized. To directly utilize the label information, a simple linear classifier is used to model the relationship between the learned hash code and the label information:

Y＝W^TB

where W ═ W1, W2 … … wC ] is the classification weight, Y ═ Y1, Y2 … … yN ] is the true label vector, and the loss function is as follows:

where L (-) is a loss function, L2 can be chosen, λ is a regularization parameter,

is the Frobenius norm of the matrix. In this formula, L (Y, W)^TB) Characterizing a deviation between a true vector label and a hash output of a batch of image outputs in the training sample set for a classification loss function;

is a regular loss function of L2.

(3) The quantization loss function C is a loss function for determining a deviation between the actual output and the theoretical output, and is a quantization loss determined by a distance between a relaxed hash feature in the middle of the actual output and the final output binarized hash value, which are in accordance with uniform distribution.

In order to calculate the deviation between the ideal and actual outputs, and make the outputs as uniform as possible, the following formula is given:

C＝(U-B).pow(2).mean()

corresponding to Q in the penalty function, U conforms to an evenly distributed initial matrix, and B is a hash matrix of the batch image output. Pow (2) is the square of each element in the matrix. And finally, taking the mean value of the matrix elements. Where U-B refers to the hash output minus the L2 metric for the relaxed tanh output.

In conjunction with the above equation, the second loss function is as follows:

F＝J+μQ+λC

where μ is a trade-off parameter. It is assumed that the learned hash code should make the pair-label likelihood as large as possible and should be optimal for joint-learned linear classification.

And solving each parameter of the hash function by using cross validation: and adjusting the learning rate and the values of the parameters lambda and mu according to a cross validation principle to further obtain a final hash function. After the network training is completed, the last fully-connected layer with C neurons passes through a sign () function, so that the real values are quantized into discrete hash codes.

In addition, in order to make the training process of the target image retrieval model easier to understand, a flow corresponding to a specific example is described, and with reference to fig. 5, at least the following steps are included:

s501, obtaining a training sample set.

S502, migrating the ResNet50 model, and constructing an initial deep learning network model according to the number of the training sample sets.

S503, setting the first loss function as loss ═ α softmax loss + β center loss + γ triple loss.

S504, setting the optimizer to be SGD, and setting the learning rate to be norm-Up.

Wherein SGD is a random Gradient Descent method and is called Stochastic Gradient Descent.

And S505, finishing training of the initial deep learning network model.

And S506, loading the initial deep learning network model, and adding a hash layer on a Conv5_ x layer of ResNet 50.

And S507, setting the second loss function as the sum of cross-entropy, regular loss function, classification loss function and quantization loss function.

S508, setting an optimizer as RMSprop, and adopting a CosinesnellingLr attenuation method for the learning rate.

And S509, finishing training of the target image retrieval model.

In addition, in order to make the structure of the target image retrieval model clearer, fig. 6 shows a schematic structural diagram of a target image retrieval model, where a Hash layer is a Hash layer, an FC is a full connection layer, a Hash feather is a Hash feature output by the Hash layer, and a Float feather is a floating point feature output by the full connection layer in the ResNet50 network.

In the actual application process, the target image retrieval model can be packaged into a micro service, a Restful API interface is provided, the basic function of the Restful API interface is to load the target image retrieval model, json is analyzed after an interface request is obtained, a request image is obtained, the target image retrieval model is called to perform inference, and floating point characteristics and hash characteristics are obtained and output as responses.

As shown in fig. 7, based on the same inventive concept, an embodiment of the present invention provides an image retrieval apparatus, which includes an image acquisition module 71, a feature determination module 72, a first filtering module 73, and a second filtering module 74.

The image obtaining module 71 is configured to obtain an initial image set to be retrieved;

the characteristic determining module 72 is configured to determine, for each image to be retrieved in the initial image set to be retrieved, a floating point characteristic and a hash characteristic of the image to be retrieved based on the target image retrieval model;

the first screening module 73 is configured to compare the hash features of the images to be retrieved with reference hash features of reference images, and screen an initial image set to be retrieved to obtain a first image set to be retrieved;

the second filtering module 74 is configured to apply the floating point feature of each first image to be retrieved in the first image set to be retrieved and the reference floating point feature of the reference image for processing, and filter the first image set to be retrieved to obtain the target image set.

In some exemplary embodiments, first screening module 73 is specifically configured to:

determining a Hamming distance between the Hash characteristics of the images to be retrieved and reference Hash characteristics aiming at each image to be retrieved;

screening an initial image set to be retrieved by applying the size relation between the Hamming distance corresponding to each image to be retrieved and a preset Hamming distance threshold;

and determining a set formed by the images to be retrieved corresponding to the Hamming distances smaller than the preset Hamming distance threshold as a first image set to be retrieved.

In some exemplary embodiments, the second filtering module 74 is specifically configured to:

determining a cosine distance between a floating point feature and a reference floating point feature of the first image to be retrieved aiming at each first image to be retrieved in the first image to be retrieved set;

screening the first image set to be retrieved by applying the size relation between the cosine distance corresponding to each first image to be retrieved and a preset cosine distance threshold;

and determining each first image to be retrieved corresponding to each cosine distance greater than a preset cosine distance threshold value to form a target image set.

In some exemplary embodiments, the feature determination module 72 is specifically configured to:

inputting each image to be retrieved in the initial image set to be retrieved to a target image retrieval model, determining that the output of a full connection layer in the target image retrieval model before is the floating point feature of the image to be retrieved, and determining that the output of a hash layer in the target image retrieval model is the hash feature.

In some exemplary embodiments, the method further includes a model training module, configured to perform the following training process to obtain the target image retrieval model:

acquiring a training sample set; wherein the training sample set comprises a plurality of training images;

constructing an initial deep learning network based on the category of each training image in the training sample set; wherein, a full connection layer is added after the initial deep learning network model is a Conv5_ x layer of a basic ResNet50 network;

setting a first loss function, training the initial deep learning network model based on a training sample set, and determining a first parameter set and a trained initial deep learning network model based on the first parameter set;

adding a hash layer after the Conv5_ x layer of the trained initial deep learning network model to obtain a second deep learning network model; the hash layer is a full connection layer with a preset number of neurons;

setting a second loss function, keeping all parameters in the first parameter set unchanged, training a second deep learning network model based on a training sample set, and determining a second parameter set and a target image retrieval model based on the second parameter set; the second parameter set comprises parameters in the first parameter set and parameters of the hash layer.

In some exemplary embodiments, the first loss function is as follows:

loss＝αsoftmax loss+βcenter loss+γtriple loss；

In some exemplary embodiments, the second loss function is as follows:

F＝J+μQ+λC；

wherein F is a second loss function, mu is a trade-off parameter, and lambda is a regular parameter; j is a metric loss function, Q is a loss function determined based on a canonical loss function and a classification loss function, and C is a quantization loss function.

Since the apparatus is the apparatus in the method in the embodiment of the present invention, and the principle of the apparatus for solving the problem is similar to that of the method, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 8, based on the same inventive concept, an embodiment of the present invention provides an electronic device, including: a processor 801 and an image acquisition unit 802.

The processor 801 is configured to:

the apparatus comprises a processor and an image acquisition unit, wherein:

the image acquisition unit 802 is configured to perform:

acquiring an initial image set to be retrieved;

the processor 801 is configured to perform:

comparing the hash characteristics of each image to be retrieved with the reference hash characteristics of the reference images, and screening the initial image set to be retrieved to obtain a first image set to be retrieved;

and processing by using the floating point characteristics of each first image to be retrieved in the first image set to be retrieved and the reference floating point characteristics of the reference image, and screening the first image set to be retrieved to obtain a target image set.

In some exemplary embodiments, the processor 801 is configured to perform:

In some exemplary embodiments, the processor 801 is configured to perform the following process to obtain the target image retrieval model:

In some exemplary embodiments, the first loss function is as follows:

loss＝αsoftmax loss+βcenter loss+γtriple loss；

In some exemplary embodiments, the second loss function is as follows:

F＝J+μQ+λC；

Embodiments of the present invention further provide a computer storage medium, in which computer program instructions are stored, and when the instructions are run on a computer, the computer is caused to execute the steps of the image retrieval method.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. An image retrieval method, comprising:

acquiring an initial image set to be retrieved;

2. The method according to claim 1, wherein the applying the hash features of the respective images to be retrieved to compare with the reference hash features of the reference image, and filtering the initial image set to be retrieved to obtain a first image set to be retrieved includes:

screening the initial image set to be retrieved by applying the size relation between the Hamming distance corresponding to each image to be retrieved and a preset Hamming distance threshold;

and determining a set formed by the images to be retrieved corresponding to the Hamming distances smaller than the preset Hamming distance threshold as a first set of images to be retrieved.

3. The method according to claim 1, wherein the applying floating point features of each first image to be retrieved in the first image set to be retrieved and reference floating point features of the reference image to process and filter the first image set to be retrieved to obtain a target image set includes:

for each first image to be retrieved in the first image set to be retrieved, determining a cosine distance between a floating point feature and a reference floating point feature of the first image to be retrieved;

and determining each first image to be retrieved corresponding to each cosine distance greater than the preset cosine distance threshold value to form a target image set.

4. The method of claim 1, wherein determining floating point features and hash features of the image to be retrieved based on a target image retrieval model for each image to be retrieved in the initial set of images to be retrieved comprises:

inputting each image to be retrieved in the initial image set to be retrieved to the target image retrieval model, determining that the output of the target image retrieval model before the full connection layer is the floating point feature of the image to be retrieved, and determining that the output of the hash layer in the target image retrieval model is the hash feature.

5. The method of claim 1, wherein the target image retrieval model is trained as follows:

acquiring a training sample set; wherein the set of training samples comprises a plurality of training images;

constructing an initial deep learning network based on the category of each training image in the training sample set; wherein, the initial deep learning network model is added with a full connection layer after a Conv5_ x layer of a basic ResNet50 network;

setting a first loss function, training the initial deep learning network model based on the training sample set, and determining a first parameter set and a trained initial deep learning network model based on the first parameter set;

adding a hash layer behind the Conv5_ x layer of the trained initial deep learning network model to obtain a second deep learning network model; wherein the hash layer is a fully connected layer with a preset number of neurons;

setting a second loss function, keeping all parameters in the first parameter set unchanged, training the second deep learning network model based on the training sample set, and determining a second parameter set and a target image retrieval model based on the second parameter set; wherein the second parameter set comprises parameters in the first parameter set and parameters of the hash layer.

6. The method of claim 5, wherein the first loss function is as follows:

loss＝αsoftmax loss+βcenter loss+γtriple loss；

7. The method according to claim 5 or 6, wherein the second loss function is as follows:

F＝J+μQ+λC；

wherein F is a second loss function, mu is a balance parameter, and lambda is a regular parameter; j is a metric loss function, Q is a loss function determined based on a canonical loss function and a classification loss function, and C is a quantization loss function.

8. An electronic device comprising a processor and an image acquisition unit, wherein:

the image acquisition unit is configured to perform:

acquiring an initial image set to be retrieved;

the processor is configured to perform:

9. The device of claim 8, wherein the processor is configured to perform:

10. The device of claim 8, wherein the processor is configured to perform: