CN113127672A

CN113127672A - Generation method, retrieval method, medium and terminal of quantized image retrieval model

Info

Publication number: CN113127672A
Application number: CN202110432335.0A
Authority: CN
Inventors: 陈斌; 王锦鹏; 夏树涛; 戴涛; 李清
Original assignee: Shenzhen International Graduate School of Tsinghua University; Peng Cheng Laboratory
Current assignee: Shenzhen International Graduate School of Tsinghua University; Peng Cheng Laboratory
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2021-07-16
Anticipated expiration: 2041-04-21
Also published as: CN113127672B

Abstract

The application discloses a generation method, a retrieval method, a medium and a terminal of a quantized image retrieval model, wherein the generation method comprises the following steps: determining a predictive quantization vector corresponding to a training image in a preset sample set by using a preset network model; determining a text vector corresponding to the training image based on the text label of the training image; and training the preset network model based on the text vector and the prediction quantization vector to obtain a quantization image retrieval model. According to the method and the device, the text labels corresponding to the training images are used as weak supervision labels, the preset network model is trained through the weak supervision labels and the prediction quantization vectors, so that the depth quantization can be learned by using weak label picture data, the problem that the existing depth quantization depends on data with high-quality labels is solved, the labor cost of the quantization image retrieval model can be reduced, and the training cost of the quantization image retrieval model is reduced.

Description

Generation method, retrieval method, medium and terminal of quantized image retrieval model

Technical Field

The present application relates to the field of image retrieval technologies, and in particular, to a method for generating a quantized image retrieval model, a retrieval method, a medium, and a terminal.

Background

Currently, a quantization technique using deep learning (for example, a deep quantization technique using a product neural network (CNN)) is widely applied to large-scale image retrieval, and has a feature of high retrieval accuracy compared to conventional quantization coding based on manual features. However, the existing depth quantization models are generally trained on image datasets with accurate manual labeling (e.g., a CIFAR-10 image dataset and an ImageNet image dataset), which requires a lot of human resources for data labeling before training the models, thereby increasing the training cost of the quantization models.

Thus, the prior art has yet to be improved and enhanced.

Disclosure of Invention

The technical problem to be solved by the present application is to provide a method for generating a quantized image search model, a search method, a medium, and a terminal, which are directed to the deficiencies of the prior art.

In order to solve the above technical problem, a first aspect of the embodiments of the present application provides a method for generating a quantized image retrieval model, where the method includes:

determining a predictive quantization vector corresponding to a training image in a preset sample set by using a preset network model;

determining a text vector corresponding to the training image based on the text label of the training image;

and training the preset network model based on the text vector and the prediction quantization vector to obtain a quantization image retrieval model.

The method for generating the quantitative image retrieval model comprises the steps that the preset sample set comprises a plurality of training image groups, and each training image group in the training image groups comprises a training image and a text label corresponding to the training image.

The method for generating the quantitative image retrieval model comprises the steps that the preset network model comprises a feature extraction module and an attention module; the determining, by using the preset network model, the pre-quantization vector corresponding to the training image in the preset sample set specifically includes:

inputting the training images in the preset sample set into the feature extraction module, and determining feature vectors corresponding to the training images through the feature extraction module;

and inputting the feature vector into the attention module, and determining a predictive quantization vector corresponding to the training image through the attention module.

The method for generating the quantized image retrieval model comprises the steps that the preset network model is provided with a plurality of preset codebooks; inputting the feature vector into the attention module, and determining the pre-quantization vector corresponding to the training image by the attention module specifically includes:

dividing the characteristic vector into a plurality of vector sections, wherein the vector sections correspond to a plurality of preset codebooks one by one;

determining quantization vector sections corresponding to the vector sections based on the preset codebooks corresponding to the vector sections;

and determining the predictive quantization vector corresponding to the training image based on the quantization vector segment corresponding to each vector segment.

The method for generating the quantized image retrieval model, wherein the determining the quantized vector segments corresponding to the vector segments based on the preset codebooks corresponding to the vector segments specifically includes:

for each vector segment in the plurality of vector segments, respectively determining each preset code word in a preset codebook corresponding to the vector segment and the attention weight of the vector segment;

and determining the quantized vector segment corresponding to the vector segment based on each preset code word and the attention weight corresponding to each preset code word so as to obtain the quantized vector segment corresponding to each vector segment.

The method for generating a quantized image retrieval model, wherein the determining, for each of a plurality of vector segments, each preset codeword in a preset codebook corresponding to the vector segment and an attention weight of the vector segment specifically includes:

for each vector segment in the plurality of vector segments, respectively calculating each preset code word in a preset codebook and a first attention weight of the vector segment, and calculating the sum of all the first attention weights;

and for each preset code word in the preset codebook, calculating the ratio of the first attention weight corresponding to the preset code word to the sum value, and taking the ratio as the attention weight corresponding to the preset code word.

The method for generating the quantitative image retrieval model comprises the steps of generating a quantitative image retrieval model, wherein the text labels comprise a plurality of text labels; the determining the text vector corresponding to the text label of the training image specifically includes:

inputting a word embedding model into each text label in the text labels, and determining candidate text vectors corresponding to the text labels through the word embedding model;

and determining a text vector corresponding to the training image based on the candidate text vector corresponding to each text label.

The generation method of the quantized image retrieval model comprises the steps that vector dimensions of candidate text vectors corresponding to the text labels are the same; the determining, based on the candidate text vectors corresponding to the text labels, the text vector corresponding to the training image specifically includes:

and calculating the average text vector of the candidate text vectors corresponding to the text vectors, and taking the average text vector as the text vector corresponding to the training image.

The method for generating the quantized image retrieval model is characterized in that the training of the preset network model based on the text vector and the predicted quantized vector to obtain the quantized image retrieval model specifically includes:

determining a loss function value corresponding to the training image according to the text vector and the prediction quantization vector;

and training the model parameters of the preset network model and a plurality of preset codebooks configured by the model parameters based on the loss function values to obtain a quantized image retrieval model and a plurality of codebooks.

A second aspect of the embodiments of the present application provides an image retrieval method, which applies a quantized image retrieval model determined by the method for generating a quantized image retrieval model as described in any one of the above, where the image retrieval method includes:

inputting a query image into the quantized image retrieval model, and determining a query vector corresponding to the query image through the quantized image retrieval model;

determining similarity between the query vector and each code word in each of a plurality of codebooks;

and retrieving a target image corresponding to the query image in a preset image database based on the determined similarity.

Before the query image is input into the quantized image retrieval model and the query vector corresponding to the query image is determined by the quantized image retrieval model, the image retrieval method further includes:

and respectively inputting each image in a preset image database into the quantized image retrieval model, and determining a quantized vector corresponding to each image through the quantized image retrieval model.

The image retrieval method, wherein retrieving, in a preset image database, a target image corresponding to the query image based on the determined similarity specifically includes:

determining candidate similarity of quantization vectors corresponding to the query image and each image in a preset image database based on the determined similarity;

searching a target image corresponding to the query image in a preset image database based on the determined candidate similarity;

if the target image is found, judging that the preset image database contains the query image;

and if the target image is not found, judging that the preset image database does not contain the query image.

A third aspect of the present embodiments provides a computer readable storage medium storing one or more programs, which are executable by one or more processors to implement the steps in the method for generating a quantized image retrieval model as described in any one of the above, and/or to implement the steps in the method for image retrieval as described in any one of the above.

A fourth aspect of the present embodiment provides a terminal device, which includes: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, implements the steps in the method for generating a quantized image retrieval model as described in any one of the above, and/or implements the steps in the method for image retrieval as described in any one of the above.

Has the advantages that: compared with the prior art, the application provides a generation method, a retrieval method, a medium and a terminal of a quantized image retrieval model, wherein the generation method comprises the following steps: determining a predictive quantization vector corresponding to a training image in a preset sample set by using a preset network model; determining a text vector corresponding to the training image based on the text label of the training image; and training the preset network model based on the text vector and the prediction quantization vector to obtain a quantization image retrieval model. According to the method and the device, the text labels corresponding to the training images are used as weak supervision labels, the preset network model is trained through the weak supervision labels and the prediction quantization vectors, so that the depth quantization can be learned by using weak label picture data, the problem that the existing depth quantization depends on data with high-quality labels is solved, the labor cost of the quantization image retrieval model can be reduced, and the training cost of the quantization image retrieval model is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without any inventive work.

Fig. 1 is a flowchart of a method for generating a quantized image retrieval model according to the present application.

Fig. 2 is a working schematic diagram of a method for generating a quantized image retrieval model according to the present application.

Fig. 3 is a schematic structural diagram of a terminal device provided in the present application.

Detailed Description

The present application provides a generation method, a retrieval method, a medium, and a terminal of a quantized image retrieval model, and in order to make the purpose, technical solution, and effect of the present application clearer and clearer, the present application will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In particular implementations, the terminal devices described in embodiments of the present application include, but are not limited to, other portable devices such as mobile phones, laptops, or tablet computers with touch sensitive surfaces (e.g., touch displays and/or touch pads). It should also be understood that in some embodiments, the device is not a portable communication device, but is a desktop computer having a touch-sensitive surface (e.g., a touch-sensitive display screen and/or touchpad).

In the discussion that follows, a terminal device that includes a display and a touch-sensitive surface is described. However, it should be understood that the terminal device may also include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.

The terminal device supports various applications, such as one or more of the following: a drawing application, a presentation application, a word processing application, a video conferencing application, a disc burning application, a spreadsheet application, a gaming application, a telephone application, a video conferencing application, an email application, an instant messaging application, an exercise support application, a photo management application, a data camera application, a digital video camera application, a web browsing application, a digital music player application, and/or a digital video playing application, etc.

Various applications that may be executed on the terminal device may use at least one common physical user interface device, such as a touch-sensitive surface. The first or more functions of the touch-sensitive surface and the corresponding information displayed on the terminal may be adjusted and/or changed between applications and/or within respective applications. In this way, a common physical framework (e.g., touch-sensitive surface) of the terminal can support various applications with user interfaces that are intuitive and transparent to the user.

It should be understood that, the sequence numbers and sizes of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process is determined by its function and inherent logic, and should not constitute any limitation on the implementation process of this embodiment.

The inventors have studied and found that a quantization technique using deep learning (for example, a deep quantization technique using a product neural network (CNN)) is widely used in large-scale image retrieval, and has a feature of high retrieval accuracy compared to conventional manual feature-based quantization encoding. However, the existing depth quantization models are generally trained on image datasets with accurate manual labeling (e.g., a CIFAR-10 image dataset and an ImageNet image dataset), which requires a lot of human resources for data labeling before training the models, thereby increasing the training cost of the quantization models.

However, in practical applications, image data with weak annotations is ubiquitous, for example, in a social media application, a user may attach a piece of comment text and select a topic tag when uploading an image, so that the image carries two weak surveillance annotations, namely the comment text and the topic tag. Although the text information carried by the picture does not necessarily reflect the content of the picture accurately, the text information can be used as a weak supervision signal containing visual semantic information of the picture.

Based on this, in the embodiment of the application, a predictive quantization vector corresponding to a training image in a preset sample set is determined by using a preset network model; determining a text vector corresponding to the training image based on the text label of the training image; and training the preset network model based on the text vector and the prediction quantization vector to obtain a quantization image retrieval model. According to the method and the device, the text labels corresponding to the training images are used as weak supervision labels, the preset network model is trained through the weak supervision labels and the prediction quantization vectors, so that the depth quantization can be learned by using weak label picture data, the problem that the existing depth quantization depends on data with high-quality labels is solved, the labor cost of the quantization image retrieval model can be reduced, and the training cost of the quantization image retrieval model is reduced.

The following further describes the content of the application by describing the embodiments with reference to the attached drawings.

The present embodiment provides a method for generating a quantized image retrieval model, as shown in fig. 1 and 2, the method including:

and S10, determining the predictive quantization vector corresponding to the training image in the preset sample set by using the preset network model.

Specifically, the preset sample set is preset and used for training a preset network model to obtain a quantitative image retrieval model. The preset sample set comprises a plurality of training image groups, each training image group in the training image groups comprises a training image and a text label, and the text label can be used as a weak supervision label of the training image. The text labels may include one label or a plurality of labels, and it is understood that the training image may correspond to one text label or a plurality of text labels, and when the training image corresponds to the plurality of text labels, the plurality of text labels are all used as the weak supervision labels of the training image. For example, the training image is a landscape photograph of a valley, and the text labels corresponding to the training image are nature, spectacular, and landscape, respectively.

In an implementation manner of this embodiment, since the user of the social media generally carries comments and/or hashtags when uploading the image, the obtaining process of the preset sample set may be: the method comprises the steps of obtaining an image uploaded by a user in a social media, extracting comments and/or topic labels carried by the image uploaded, taking the extracted text information as a text label corresponding to the image, and finally taking the image and the text label corresponding to the image as a group of training image groups to obtain a preset sample set. Of course, in practical applications, the training images in the preset sample set may also be determined in other manners, for example, shooting the training images through an imaging module, and configuring text labels for the shot training images to form the preset sample set. According to the implementation mode, the uploaded image of the user of the social media is used as the training image, and the comments and/or the topic labels carried by the user are used as the text labels, so that the acquisition speed of the preset sample set can be increased, and the training speed of the quantitative image retrieval model can be increased.

As shown in fig. 2, the preset network model includes a feature extraction module and an attention module; the determining, by using the preset network model, the pre-quantization vector corresponding to the training image in the preset sample set specifically includes:

Specifically, the feature extraction module is configured to extract a feature vector corresponding to a training image, where the feature extraction module may include a feature extraction unit and a conversion unit, an input item of the feature extraction unit is the training image, an output item of the feature extraction unit is a feature map corresponding to the training image, an input item of the conversion unit is a feature map, and an output item of the conversion unit is the feature vector corresponding to the training image. As can be understood, the training image is input into the feature extraction unit, and the feature extraction unit outputs the feature map corresponding to the training image; and inputting the feature map into a conversion unit, and outputting the feature vector corresponding to the training image through the conversion unit.

In an implementation manner of this embodiment, the feature extraction unit may employ a convolutional neural network model, where the convolutional neural network model may include an input layer, a plurality of convolutional layers, and a plurality of fully-connected layers, the input layer, the plurality of convolutional layers, and the plurality of fully-connected layers are sequentially cascaded, an input item of the input layer is a training image, and an input item of the last fully-connected layer is a feature map. In one specific implementation, the number of convolutional layers may be 4 convolutional layers, and the number of fully-connected layers may be 2 fully-connected layers. Of course, in practical applications, the number of the plurality of convolutional layers may be determined according to practical requirements, for example, 5 convolutional layers, etc. Furthermore, the conversion unit is configured to expand the feature map into feature vectors, wherein a vector dimension of the feature vectors is determined based on an image scale of the feature map, and the vector dimension of the feature vectors is equal to an image scale product. For example, the image scale of the feature map is 40 × 3, and then the vector dimension of the feature vector is 40 × 3 — 4800.

In an implementation manner of this embodiment, the preset network model is configured with a plurality of preset codebooks; each preset codebook in the plurality of preset codebooks comprises a plurality of code words, each code word in the plurality of code words is different from each other and can be used as a quantization code corresponding to the training image, wherein the number of the plurality of code words in each preset codebook in the plurality of preset codebooks can be the same, different or different, and can be specifically determined based on actual requirements.

Based on this, the inputting the feature vector into the attention module, and the determining, by the attention module, the pre-quantization vector corresponding to the training image specifically includes:

dividing the feature vector into a plurality of vector segments;

Specifically, each vector section in the plurality of vector sections is not overlapped, and the plurality of vector sections constitute the feature vector, wherein the number of vector sections of the plurality of vector sections is equal to the number of the plurality of preset codebooks, and the plurality of vector sections correspond to the plurality of preset codebooks one to one. It can be understood that, when dividing the eigenvector into a plurality of vector sections, the codebook numbers of a plurality of preset codebooks may be obtained, and the eigenvector may be divided into a plurality of vector sections based on the codebook numbers. In a specific implementation, the vector dimensions of each of the vector segments are the same, for example, the vector dimension of the feature vector is D, then the number of the vector segments is D, and the vector dimension of each vector segment is M, where D is D × M.

In an implementation manner of this embodiment, a plurality of preset codebooks may be configured with a codebook sequence in advance, after a plurality of vector segments divided by a feature vector are sorted according to the sequence of the vector segments in the feature vector, a vector segment sequence formed by the plurality of vector segments corresponds to a codebook sequence formed by the plurality of preset codebooks, where corresponding means that the positions of the vector segments in a vector end sequence are the same as the positions of the preset codebooks corresponding to the vector end sequence in the codebook sequence. For example, the plurality of vector segments include a vector segment a and a vector segment B, the plurality of preset codebooks include a preset codebook a and a preset codebook B, a vector segment sequence formed by the vector segment a and the vector segment B is < vector segment a, vector segment B >, and a codebook sequence formed by the preset codebook a and the preset codebook B is < preset codebook a, preset codebook B >, so that the preset codebook corresponding to the vector segment a is the preset codebook a, and the preset codebook corresponding to the vector segment B is the preset codebook B.

The quantization vector segment is used for representing a preset codebook, and the vector segment is quantized through the quantization vector segment, so that the feature vector can be quantized through the quantization vector segment corresponding to each vector segment, and the training image corresponding to the feature vector is quantized. In a specific implementation manner of this embodiment, the determining, based on the preset codebook corresponding to each vector segment, a quantized vector segment corresponding to each vector segment specifically includes:

Specifically, the attention weight is a weight reflecting a corresponding preset code word in the quantized vector segment, and is used for reflecting the importance degree of the preset code word in the quantized vector segment; the greater the attention weight is, the higher the importance degree of the preset code word corresponding to the attention weight is, and conversely, the smaller the attention weight is, the lower the importance degree of the preset code word corresponding to the attention weight is. The attention weight corresponding to each preset codeword may be preset, or may be calculated based on the preset codeword and the vector segment by using an attention mechanism.

In an implementation manner of this embodiment, for each of the plurality of vector segments, respectively determining each preset codeword in the preset codebook corresponding to the vector segment and the attention weight of the vector segment specifically includes:

Specifically, the preset codebook is a preset codebook corresponding to the vector segment, each preset codeword in the preset codebook corresponds to a first attention weight, and the first attention weight may be determined based on cosine similarity between the vector segment and the preset codeword, or determined based on vector product between the vector segment and the preset codeword, and so on. In a specific implementation manner of this implementation, the calculation formula of the first attention weight may be:

wherein v is^mFor the m-th vector segment, the vector is,

is the transposed vector of the m-th vector segment,

is the kth preset code word in the mth preset codebook.

Further, after the first attention weight is obtained, the calculation formula of the attention weight corresponding to the preset code word may be:

wherein v is^mFor the m-th vector segment, the vector is,

is the transposed vector of the m-th vector segment,

is the kth preset code word in the mth preset code book, K is the number of the preset code words in the mth preset code book,

is the kth preset code word in the mth preset code book.

After the attention weights corresponding to the preset codewords in the preset codebook are obtained, the quantization vector segments corresponding to the vector segments may be determined based on the attention weights corresponding to the preset codewords, where the quantization vector segments may be preset codewords with the largest attention weight in the preset codewords, or obtained based on the preset codewords in the preset codebook and the attention weights corresponding to the preset codewords.

In an implementation manner of this embodiment, the quantized vector segment corresponding to the vector segment is obtained by weighting each preset codeword in a preset codebook corresponding to the vector segment, and a calculation formula of the quantized vector segment may be:

wherein K is the number of preset code words in a preset codebook,

the quantized vector segment corresponding to the mth vector segment,

for the attention weight corresponding to the kth preset code word,

is the kth preset code word.

After the quantized vector segments corresponding to the vector segments are obtained, the quantized vector segments corresponding to the vector segments are connected according to the positions of the corresponding vector segments in the feature vector, so that the predicted quantized vector corresponding to the feature vector is obtained. For example, the feature vectors correspond to vector segments of v¹，v²，...，v^NVector segment v¹The corresponding quantized vector segment is

Vector segment v²The corresponding quantized vector segment is

.., vector segment v^NThe corresponding quantized vector segment is

Then the feature vector (v)¹，v²，...，v^N) Corresponding predictive quantization vector

And S20, determining a text vector corresponding to the training image based on the text label of the training image.

Specifically, the text vector is a word vector corresponding to the text label, and it can be understood that the text vector corresponding to the text label can be determined by a word vector model, for example, inputting the text label into the word vector model, outputting a word vector corresponding to the text label by the word vector model, and taking the word vector as a text vector corresponding to the training image. In addition, the training image may correspond to one text label or correspond to a plurality of text labels, and when the training image corresponds to one text label, the word vector corresponding to the text label is the text vector corresponding to the training image; when the training image corresponds to a plurality of text labels, the text vector corresponding to the training image may be determined based on the word vector corresponding to each text label in the plurality of text labels.

In one implementation manner of this embodiment, the text labels include a plurality of text labels; the determining the text vector corresponding to the text label of the training image specifically includes:

Specifically, the word embedding model is trained, and when a text label is input into the word embedding model, the word embedding model may output a candidate text vector corresponding to the text label, so that each text label of the text labels is input into the word embedding model, and the candidate text vector corresponding to each text label is determined by the word embedding model. In addition, in the obtained candidate text vectors corresponding to the text labels, an average value of the candidate text vectors may be used as the text vector corresponding to the training image, or the candidate text vectors are weighted to obtain the text vector corresponding to the training image, or one candidate text vector is randomly selected from a plurality of candidate text vectors to be used as the text vector corresponding to the training image, and the like. In an implementation manner of this embodiment, the vector dimensions of the candidate text vectors corresponding to the text labels are the same; the determining, based on the candidate text vectors corresponding to the text labels, the text vector corresponding to the training image specifically includes: and calculating the average text vector of the candidate text vectors corresponding to the text vectors, and taking the average text vector as the text vector corresponding to the training image.

S30, training the preset network model based on the text vector and the prediction quantization vector to obtain a quantization image retrieval model.

Specifically, the quantized image retrieval model is obtained by training the preset network model, the model structure of the quantized image retrieval model is the same as that of the preset network model, and the difference between the model structure of the quantized image retrieval model and that of the preset network model is that the model parameters of the preset network model are initial model parameters, and a plurality of preset codebooks configured by the preset network model are preset; the model parameters of the quantized image detection model are trained model parameters, and the quantized image retrieval model is configured with a plurality of codebooks, wherein the codebooks are determined in the process of training a preset network model based on a preset sample set. It can be understood that, when the preset network model is trained based on the text vector and the predicted quantization vector, the model parameters of the preset network model and the plurality of preset codebooks are trained, and when the quantized image retrieval model is obtained by training, the trained codebooks are obtained.

Based on this, the training the preset network model based on the text vector and the predicted quantization vector to obtain a quantized image retrieval model specifically includes:

Specifically, the loss function value is determined based on a text vector and a prediction quantization vector, and in the training process, a preset training sample can be divided into a plurality of training batches, when one training batch in the plurality of training batches is used for training a preset network model, the loss function value is determined based on training images included in the training batch, and the preset network model is trained. Of course, each training image may also be used as a training batch, and after a preset network model is trained based on the training image, the loss function value corresponding to the training image is determined.

In an implementation manner of this embodiment, the calculation formula of the loss function value may be:

wherein L is the loss function value, B is the size of the training batch,

for the predictive quantization vector of the kth training image,

predictive quantization vector for k-th training image

Transposed vector of (d), t_kFor the text vector corresponding to the k training image, t_jAnd the text vector corresponding to the jth training image.

In summary, the present embodiment provides a method for generating a quantized image retrieval model, where the method includes: determining a predictive quantization vector corresponding to a training image in a preset sample set by using a preset network model; determining a text vector corresponding to the training image based on the text label of the training image; and training the preset network model based on the text vector and the prediction quantization vector to obtain a quantization image retrieval model. According to the method and the device, the text labels corresponding to the training images are used as weak supervision labels, the preset network model is trained through the weak supervision labels and the prediction quantization vectors, so that the depth quantization can be learned by using weak label picture data, the problem that the existing depth quantization depends on data with high-quality labels is solved, the labor cost of the quantization image retrieval model can be reduced, and the training cost of the quantization image retrieval model is reduced. In addition, firstly, in the embodiment, a text vector corresponding to a training image based on word vector averaging is used, so that the interference of a noise label is automatically eliminated, the text semantic information is enhanced, the effect of the weak supervised learning based on the text vector can be effectively improved, and the training effect of a quantized image retrieval model is further improved; secondly, the training process of deep quantization coding can be carried out end to end through a quantization image retrieval model based on end-to-end product quantization of an attention mechanism, and the precision of an image retrieval technology can be improved; finally, the scheme directly matches the quantized picture representation vector with the corresponding text representation vector by comparing the learning loss function, and can obtain the quantized vector with stronger semantic representation capability.

To further illustrate the effect of the quantized image detection model determined by the method for generating the quantized image retrieval model provided by the embodiment, the public tests on MIR-FLICKR25K and NUS-WIDE data sets, and the MAP indicators when the encoding lengths are 8bits, 16bits, 24bits and 32bits respectively, compare the mainstream methods in the industry, and the results are shown in the following table.

Based on the above method for generating a quantized image retrieval model, this embodiment further provides an image retrieval method, which applies the quantized image retrieval model determined by the above method, where the image retrieval method includes:

Specifically, the query vector is determined by a quantized image retrieval model for the query image, and it can be understood that the query vector is the quantized vector of the query image determined based on the quantized image retrieval model, wherein the process of determining the query vector corresponding to the query image by the quantized image retrieval model can be that the query image is input into the quantized image retrieval model, and the feature vector corresponding to the query image is determined by the quantized image retrieval model; dividing the feature vector into a plurality of feature vector sections based on a plurality of codebooks, selecting candidate code words corresponding to the feature vector sections from the codebooks corresponding to the feature vector sections, and finally connecting the candidate code words corresponding to the feature vector sections to obtain the query vector corresponding to the query image.

In an implementation manner of this embodiment, for each eigenvector segment in the plurality of eigenvector segments, cosine similarity between the eigenvector segment and each codeword in the codebook corresponding to the eigenvector segment is determined, then a codeword with the largest cosine similarity is selected from the plurality of codewords, and the selected codeword is used as a candidate codeword corresponding to the eigenvector segment. The calculation formula of the cosine similarity between the feature vector segment and each code word in the codebook corresponding to the feature vector segment may be:

wherein v is^mIn order to be a segment of the feature vector,

is the ith codeword in the mth codebook, C^mIs the mth codebook, is

Is composed of

The transposed vector of (1).

In addition, in practical application, each preset code word in the preset codebook is configured with a codeAnd the word identifiers can be stored after the candidate code words corresponding to the vector segments are obtained, and the feature vectors of the training images are converted into the quantization vectors identified by a plurality of code words. For example, the predetermined codebooks are C¹，C²，...，C^NWherein N is the number of the predetermined codebooks, and the predetermined codebooks include a plurality of predetermined codewords

Wherein m represents the mth preset codebook, K is the number of preset codewords, and the candidate codewords corresponding to the vector segment m are

Then the codeword identification for vector segment m may be k.

Based on this, in an implementation manner of this embodiment, before the query image is input to the quantized image retrieval model, and a query vector corresponding to the query image is determined by the quantized image retrieval model, the method further includes:

Specifically, the process of determining the quantization vector may be the same as the process of determining the query vector, and is not described herein again. In addition, after the quantization vectors corresponding to the respective images are determined, the codewords corresponding to the respective codewords in the quantization vectors can be used for representing the codewords, so that the quantization vectors of the plurality of codeword identifications are obtained, each image in a preset image database can be converted into the quantization vectors represented by the plurality of codeword identifications, the image database can be converted into the plurality of quantization vectors and the plurality of codebooks, and the storage space required by the image database can be saved.

In one implementation manner of this embodiment, the similarity is a similarity between the query vector and each of the codewords in the codebooks, that is, each of the codewords in the codebooksThe code words all correspond to a similarity to form a similarity list, so that the process of searching the query image in the image database can be converted into the process of summing the query similarity lists, and the image retrieval speed can be improved. Wherein the similarity between the query vector and the codeword may be

Wherein S is_q,(m,i)Representing the similarity of the query vector to the codeword,

for query vector r_qThe transposed vector of (a) is,

is a codebook C_mThe ith codeword in (1) and, correspondingly, the sequence of similarity between the query vector and the mth codebook may be

Wherein S is_q,mAs a sequence of similarity of the query vector to the mth codebook, C_mIs the mth codebook.

In an implementation manner of this embodiment, the retrieving, based on the determined similarity, the target image corresponding to the query image in a preset database specifically includes:

determining candidate similarity of quantization vectors corresponding to the query image and each image in a preset database based on the determined similarity;

searching a target image corresponding to the query image in a preset database based on the determined candidate similarity;

if the target image is found, judging that the preset database contains the query image;

and if the target image is not found, judging that the preset database does not contain the query image.

Specifically, the calculation formula of the candidate similarity may be:

wherein, b_mFor presetting the quantization vector segment of the image in the image database with respect to the mth codebook, S_q,mAs a sequence of similarity of the query vector to the mth codebook, C_mFor the m-th codebook, the codebook is,

for query vector r_qThe transposed vector of (1).

Further, after the candidate similarity between each query image and each image is obtained, whether the candidate similarity larger than a preset threshold exists or not can be searched in the candidate similarity, if the candidate similarity larger than the preset threshold exists, the image corresponding to the candidate similarity larger than the preset threshold is used as a target image corresponding to the query image, and the fact that the query image is contained in the preset database is judged; if the candidate similarity larger than the preset threshold does not exist, judging that the target image is not found, and correspondingly judging that the preset database does not contain the query image.

Based on the above-described method for generating a quantized image retrieval model, the present embodiment provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors, to implement the steps in the method for generating a quantized image retrieval model according to the above-described embodiment.

Based on the above generation method of the quantized image retrieval model, the present application further provides a terminal device, as shown in fig. 3, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory)22, and may further include a communication Interface (Communications Interface)23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.

Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional application and data processing, i.e. implements the method in the above-described embodiments, by executing the software program, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

In addition, the specific processes loaded and executed by the storage medium and the instruction processors in the mobile terminal are described in detail in the method, and are not stated herein.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for generating a quantized image search model, the method comprising:

2. The method for generating a quantitative image retrieval model according to claim 1, wherein the preset sample set includes a plurality of training image groups, and each training image group in the plurality of training image groups includes a training image and a text label corresponding to the training image.

3. The method for generating a quantitative image retrieval model according to claim 1, wherein the preset network model comprises a feature extraction module and an attention module; the determining, by using the preset network model, the pre-quantization vector corresponding to the training image in the preset sample set specifically includes:

4. The method for generating a quantized image retrieval model according to claim 3, wherein said predetermined network model is configured with a plurality of predetermined codebooks; inputting the feature vector into the attention module, and determining the pre-quantization vector corresponding to the training image by the attention module specifically includes:

5. The method for generating a quantized image retrieval model according to claim 4, wherein said determining the quantized vector segments corresponding to the vector segments based on the preset codebooks corresponding to the vector segments specifically comprises:

6. The method of claim 5, wherein the determining, for each of the plurality of vector segments, the attention weight of each predetermined codeword in the predetermined codebook and the vector segment corresponding to the vector segment specifically comprises:

7. The method for generating a quantitative image retrieval model according to claim 1, wherein the text label comprises a plurality of text labels; the determining the text vector corresponding to the text label of the training image specifically includes:

8. The method for generating a quantized image retrieval model according to claim 7, wherein the vector dimensions of the candidate text vectors corresponding to the text labels are the same; the determining, based on the candidate text vectors corresponding to the text labels, the text vector corresponding to the training image specifically includes:

9. The method for generating a quantized image retrieval model according to any of claims 1 to 8, wherein the training the preset network model based on the text vector and the predicted quantized vector to obtain the quantized image retrieval model specifically comprises:

10. An image retrieval method to which a quantized image retrieval model determined by the method for generating a quantized image retrieval model according to any one of claims 1 to 9 is applied, the image retrieval method comprising:

11. The image retrieval method of claim 10, wherein before inputting the query image into the quantized image retrieval model, determining a query vector corresponding to the query image by the quantized image retrieval model, the method further comprises:

12. The image retrieval method of claim 11, wherein retrieving, based on the determined similarity, the target image corresponding to the query image in a preset image database specifically includes:

13. A computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps in the method for generating a quantitative image retrieval model according to any one of claims 1 to 9, and/or to implement the steps in the method for image retrieval according to any one of claims 10 to 12.

14. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps in the method for generating a quantitative image retrieval model according to any of claims 1 to 9, and/or implements the steps in the image retrieval method according to any of claims 10 to 12.