CN111753116A

CN111753116A - Image retrieval method, device, equipment and readable storage medium

Info

Publication number: CN111753116A
Application number: CN201910452983.5A
Authority: CN
Inventors: 潘滢炜; 姚霆; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-05-20
Filing date: 2019-05-28
Publication date: 2020-10-09
Anticipated expiration: 2039-05-28
Also published as: CN111753116B

Abstract

The invention discloses an image retrieval method, an image retrieval device, image retrieval equipment and a readable storage medium. The method comprises the following steps: performing semantic embedding based on the acquired click data for image retrieval of the user; performing visual embedding based on an attention mechanism; projecting the query set and the corresponding images into a low-dimensional embedding space through the click-based semantic embedding and attention integrated visual embedding, and performing target training; and performing image retrieval based on the keywords.

Description

Image retrieval method, device, equipment and readable storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to an image retrieval method, an image retrieval device, image retrieval equipment and a readable storage medium.

Background

The amount of image data generated, distributed, and propagated has increased explosively, becoming an indispensable part of today's big data. This has led to rapid development of research efforts for large-scale image retrieval. One basic research problem is a keyword-based image retrieval method that attempts to retrieve images that are most relevant to a keyword and rank the images according to their relevance to a given retrieved text. Since text queries and visual graphs are in two different modalities, the similarity between them cannot be evaluated directly. This problem is commonly referred to as the "semantic gap". Most commercial search engines use correlation models, such as vector space model, BM25 and language models, to circumvent this problem by performing similarity measurements on text related to images. However, similarity measurements for text-based models may not always be accurate, especially when the text description cannot depict important visual content, let alone some images are not even associated with any text. Another solution to the "semantic gap" problem is to learn an image ranker based on query image pairs, which are typically labeled by human experts. However, manual labeling is often too expensive to be affordable, and such methods are difficult to apply on a large scale. Even experts sometimes have difficulty in determining the user's search intent or finding correlations between queries and images, resulting in noisy labels on the training data.

The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of the above, the present invention provides an image retrieval method, apparatus, device and readable storage medium, which can perform image retrieval based on visual attention and depth structure preservation.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to an aspect of the present invention, there is provided an image retrieval method including: performing semantic embedding based on the acquired click data for image retrieval of the user; performing visual embedding based on an attention mechanism; projecting the query set and the corresponding images into a low-dimensional embedding space through the semantic embedding and the visual embedding, and performing target training; and performing image retrieval based on the keywords.

According to an embodiment of the present invention, performing semantic embedding based on the acquired click data for image retrieval by the user includes: constructing a bipartite graph based on the click data, the bipartite graph including a plurality of the images and at least one query for each of the images; merging the at least one query of each of the images into one of the query sets, respectively; and for each of the query sets, performing the following operations: merging the at least one query weighted by the number of clicks of the query in the query set and learning semantic embedding based on a cumulative representation form of the query set to generate click-based query set representations of the query set, respectively; and applying a single-layer neural network to generate a query set representation of the query set in the embedding space for the click-based query set representation of the query set.

According to an embodiment of the present invention, performing visual embedding based on an attention mechanism includes: for each of the images, the following operations are performed: determining an overall image feature map for the image based on an output feature map of convolutional layers of a deep convolutional neural network, the overall image feature map comprising local descriptors for a plurality of regions; combining the M attention layers into the deep convolutional neural network, and performing the following operations: inputting the overall image features into the M attention layers respectively to generate attention distribution of the images of the attention layers; and based on the attention distribution of each attention layer, respectively performing weighted combination on the local descriptors of the plurality of regions to generate M aggregated image representations of the image; obtaining an output image representation of the image by averaging the M aggregated image representations of the image; and applying a visual embedding layer of the deep convolutional neural network to embed the output image representation into the embedding space to obtain an image representation of the image in the embedding space.

According to an embodiment of the present invention, projecting a query set and a corresponding image into a low-dimensional embedding space through the semantic embedding and the visual embedding, and performing target training includes: for each of the query sets, determining a loss function for target training, comprising: determining a cross-modal ranking constraint based on a limit ranking loss according to the query set, a first image clicked by any query in the query set and a second image not clicked by any query in the query set; determining a neighborhood cross-modal ordering constraint based on a limit ordering loss according to the query set, a third image with similar semantics with the first image and the second image; determining neighborhood structure preservation constraints based on structure preservation regularization according to the first image, the second image and the third image; and determining the loss function according to the cross-modal ordering constraint, the neighborhood cross-modal ordering constraint and the neighborhood structure retention constraint.

According to an embodiment of the present invention, the projecting the query set and the corresponding image into the low-dimensional embedding space through the semantic embedding and the visual embedding, and the performing the target training further includes: target training the query set and the image based on the loss function to minimize an overall loss.

According to an embodiment of the present invention, performing an image search based on a keyword includes: and according to the text query provided by the keyword, in the embedding space, the image retrieval is carried out according to the ranking of the inner products of the text query and the images in the embedding space.

According to an embodiment of the present invention, the embedding space is a low-dimensional embedding space for learning similarity between the query set and the image, and the similarity between the query set and the image is directly measured by an inner product of mapping in the embedding space.

According to another aspect of the present invention, there is provided an image retrieval apparatus including: the semantic embedding module is used for carrying out semantic embedding on the basis of the acquired click data for image retrieval of the user; the visual embedding module is used for carrying out visual embedding based on an attention mechanism; the target training module is used for projecting the query set and the corresponding images into a low-dimensional embedding space through the semantic embedding and the visual embedding so as to carry out target training; and the image retrieval module is used for carrying out image retrieval based on the keywords.

According to still another aspect of the present invention, there is provided a computer apparatus comprising: a memory, a processor and executable instructions stored in the memory and executable in the processor, the processor implementing any of the methods described above when executing the executable instructions.

According to yet another aspect of the present invention, there is provided a computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement any of the methods described above.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

Fig. 1 is an image retrieval example shown according to an example.

FIGS. 2 and 3 are depth structure preserving with embedded visual attention (DSPEA) model overviews based on click data, shown in accordance with an exemplary embodiment.

FIG. 4 is a flow diagram illustrating an image retrieval method according to an exemplary embodiment.

FIG. 5 is an example of a click data set shown according to an example.

Fig. 6 shows a variation of training loss versus validation loss as the number of training iterations increases.

Fig. 7 shows the results of a picture search by ten different methods according to an example.

Fig. 8 is a result of picture search by ten different methods shown according to another example.

FIG. 9 is an illustration of four image examples using a multi-head attention mechanism, according to an example.

Figure 10 is a graphical representation of NDCG performance curves for different embedding dimensions using different methods.

Fig. 11 illustrates the improvement in performance of NDCG @25 over different combinations of constraints by using cross-modal ordering (CR) constraints only across different dimensions of the embedding space.

Fig. 12 is a block diagram illustrating an image retrieval apparatus according to an exemplary embodiment.

FIG. 13 is a block diagram illustrating a computer system in accordance with an exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Further, in the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and that there may be three cases of a alone, B alone, and a and B simultaneously. The symbol "/" generally indicates that the former and latter associated objects are in an "or" relationship. The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

One fundamental problem in image retrieval is how to learn a ranking function, such as the similarity between text query terms and images. Recent work on this problem can be categorized into two broad categories: text-based models and image-ordering-based models. The former relies on the text to which the image is attached, so the similarity obtained by the method is sensitive to the attached text, and the attached text can bring low-quality similarity measurement once the noise is too large. For the latter, if a manually labeled query image pair does not accurately reflect the habits of the user query, it will face robustness issues. In the present invention, we will show how to learn the cross-modal feature space by using user click data, thereby solving the two limitations mentioned above. Specifically, a deep structure preserving model based on visual attention (DSPEA) built on user click data is proposed in the present invention, which comprises two parts: the first part is an image feature layer built on top of a deep convolutional neural network, which is responsible for learning visual features. The second part is a text semantic feature layer built on a deep neural network, which is used for generating semantic features of text queries. At the same time, a visual attention mechanism is integrated into the convolutional neural network to reflect the region of the image relevant to the text query. Furthermore, considering that the query feature space has very high dimensionality, a new query set feature based on user click data is proposed in the present invention to alleviate this high-dimensional sparsity problem. The whole framework can be trained end-to-end by optimizing a large boundary objective function, and the objective function combines cross-modal ordering constraint and intra-modal structure-preserving neighbor constraint. Compared with a plurality of advanced retrieval models at present, on a large-scale click data set with one thousand, one hundred and seventy thousand queries and one million pictures, the model is more excellent in image retrieval tasks based on keywords, reaches 52.21% on the index NDCG @25, and is the highest value so far.

The image retrieval method provided by the invention solves the two problems. First, the method studies cross-modality (text-image) embedding by learning a common embedding space that allows direct comparison of the retrieved text and the visual image. Thus, by mapping the visual image representation and the retrieved text features to the embedding space, the similarity between the retrieved text and the image can be directly compared. In addition, the dimension of the embedding space is greatly reduced, and the memory consumption is greatly reduced. Some researchers have made some attempts based on such methods, such as Canonical Correlation Analysis (CCA), Partial Least Squares (PLS), and cross-modal learning (CCL) based on click-through. Although learning objectives are different in these methods, their practice is to compress the entire image into a static representation while not maximally compressing the image regions that are most relevant to the query. This processing method is particularly important when there is a lot of disturbances in the image. Fig. 1 is an image retrieval example shown according to an example. Taking the query and image in fig. 1 as an example, the region most relevant to the query "customize wedding cake decoration" points to the decoration of the top of the cake in the image. Furthermore, the vocabulary of the query text space is typically very large, e.g., up to tens of thousands or even millions, which makes the features of a single query very sparse.

The present invention is primarily directed to a new deep-embedding model that guides the model to the most relevant regions in the image by integrating the visual attention mechanism and will represent the input query set rather than the input individual queries. Second, click data is studied as an effective means for understanding the search intention of the user in image search. Because most image search engines display search results in the form of thumbnails, the user may browse through the image search results before clicking on a particular image, and thus the user is primarily inclined to click on images that are relevant to their query intent. Thus, the click data can serve as reliable, implicit image search feedback data.

By integrating the idea of cross-mode embedding and click data, the invention provides a novel image retrieval method of a click data-based depth structure preserving with embedded visual attention (DSPEA) model. Specifically, a bipartite graph between a query and an image is constructed based on image click data from a real image search engine, which causes a link to be established between the query and the clicked image if the image is clicked by a user who enters the query. All queries with edges linked with images are combined into a query set, and the representation of the query set is input into a deep neural network for semantic learning (query) embedding. In essence, similarly, the image representation is extracted using a deep Convolutional Neural Network (CNN), and then the embedding layer generates visual (image) embedding. To extract the regions of the image that are most relevant to the query, multiple independent Attention layers are focused into the CNN, and a Multi-head Attention mechanism is employed to provide a comprehensive Attention distribution in all regions. The goal of DSPEA is to learn queries and image embedding by ensuring that relevant images have a higher similarity to the query set than irrelevant images and preserving the neighborhood structure in a single modality. After optimizing the embedding, the similarity between the query and the image in the original space can be computed directly from the inner product of their mappings in the embedding space. Notably, the entire deep frame can be trained in an end-to-end manner.

The main contribution of the invention is to provide a DSPEA framework, learn structure retention embedding based on click data, and measure the similarity between the query and the image in the image retrieval. The solution also brings out a clever idea of how attention should be drawn in the similarity measurement of the query and the image, and how to solve the problem of sparsity of the query features, which has not been completely solved in the existing literature.

The invention mainly researches the similarity measurement problem of cross-modal embedding learning. We briefly divide the correlation work into two dimensions: a conventional cross-modal approach and a deep learning based cross-modal approach.

1. The traditional cross-modal approach:

conventional approaches aim at directly learning the similarity between target pairs, especially in shared potential subspaces. CCA and PLS are the two most common cross-modal learning methods that utilize linear transformation matrices to maximize the correlation between the maps computed by cosine and dot product, respectively. CCA is then extended to its kernel version, known as kernel CCA (kcca), by replacing the linear mapping with a non-linear mapping. For example, in the related art, there is a method of constructing a kernel embedding space between an image and a video to solve the domain transfer problem. Still other approaches further propose a triple-modality CCA that adds a third modality to the CCA to explicitly emphasize potential correlations between different modalities. Still other methods propose a Polynomial Semantic Index (PSI) learning two low-rank transformation matrices, used in order scheme learning to measure query-to-text similarity. Still another approach is to learn the similarity of queries to text from click-through bipartite graphs, using multi-view PLS (M-PLS, i.e. extended PLS technology) and multi-feature types. Similarly in essence, there are methods that treat image retrieval as a click-based cross-modal problem, minimizing the distance of the query from the image in the subspace, while preserving the inherent structure in the original space. Still other approaches expand CCA as a rank-ordered canonical correlation analysis (RCCA) that simultaneously learns a bilinear query-image similarity function and adjusts the subspace to preserve the potential preference relationships in the click data. More recently, a further approach has been proposed to provide a supervised cross-modal learning approach that utilizes class labels to learn consistency features for cross-pattern matching. Later, there were also methods for cross-modal retrieval using a two-way search scheme that identified semantically matching image-text pairs from unmatched image-text pairs while maximizing consistency between modalities. In addition, there is a method of multi-modal discriminant analysis (MvDA) designed to learn to discriminate public spaces by optimizing generalized rayleigh quotient based on class information.

2. The cross-modal embedding method based on deep learning comprises the following steps:

inspired by recent advances in deep learning in a large number of research efforts (e.g., recognition and detection), researchers have focused on designing deep architectures to bridge the semantic gap between different modalities for similarity learning. The deep visual-semantic embedding model (DeViSE) is one of the early works in building visual-semantic embedding spaces in deep structures. And constructing a loss function by adopting a method of combining the dot product similarity and the hinge sequencing loss (ranking loss), and training the whole network. Later approaches extended CCA to an end-to-end deep learning scheme, deep CCA (DCCA), to measure the similarity between image-header pairs. Still other approaches learn the image-word embedding space from a diverse set of images with a large number of social images and tags. In addition, deep visual-semantic embedded learning is also used to solve visual-linguistic questions (e.g., image descriptions and visual question answers). More notably, there are some research efforts exploring cross-modal models and single-click data based on deep learning. For example, a word-based image retrieval deep neural network is proposed that first learns high-level image representations and then maps the images to bag-of-words space. The similarity between the final image and the query is measured by the cosine similarity between the bag-of-words representation of the query and the projected bag-of-words representation of the image. Another deep model trained on click data is a click-based deep visual-semantic embedding (C-DVSE) model, which consists of two deep neural networks for learning visual embedding and learning semantic embedding, respectively. The whole framework is optimized with a correlation-lossy layer to measure the similarity between the two learned embeddings.

The work of the present invention is in the field of deep learning based methods. Unlike the depth model described above, this approach not only optimizes the common embedding space for similarity learning across modal ordering constraints, but also preserves the neighborhood structure within the modal. In addition, an attention mechanism and a query set representation method based on clicking are integrated into the whole system structure, most regions of the image relevant to the query are extracted, and the problem of sparsity of query features is solved.

The basic idea of the present invention proposed for Deep Structure Preservation (DSPEA) based on embedded visual attention of clicks is to facilitate similarity learning between queries and images from click data by building a common embedding space in the deep architecture. Thus, text queries and visual images that would otherwise be incomparable can be directly compared in this common space. In particular, the DSPEA of the invention consists of two branches: a click-based query set representation for learning semantic embedding and a visual attention-based image representation for learning visual embedding. In DSPEA the training of both branches is done simultaneously by ensuring that the rank order across modalities is implicitly displayed in the click data, and preserving the neighborhood structure in a single modality. Thus, the objective function of the DSPEA includes two parts, namely, a cross-modal ordering constraint between the query set and the image and a neighborhood structure preservation constraint in the image. Method overview as shown in fig. 2 and 3, where fig. 2 and 3 are depth structure preserving with embedded visual attention (DSPEA) model overviews based on click data shown according to an exemplary embodiment. Fig. 2(a) is a bipartite graph based on an image retrieval log, and fig. 2(b) is an image graph constructed by combining queries in which edges in the bipartite graph are connected to images into a query set and calculating semantic similarity between the query sets. Fig. 3 shows the weighted convolutional layer local descriptors obtained by linear fusion based on attention probability. Cross-modal ordering constraints and intra-modal neighborhood structural constraints are minimized while query-image embedding is optimized.

Before describing the method 10, the following notations are explained.

Assume that there is a click bipartite graph

Wherein

Representing a set of vertices, including a set of queries

And an image set V. Is the set of edges connecting the query vertex and the image vertex. The weight of each edge represents the total number of times the image was clicked on after the query of the query was entered. Suppose that a total of n three-dimensional arrays are generated in the click bipartite graph

Each array representing an image v_iResponding to a query q_iIs clicked on c_iNext, the process is carried out. Note that each query q will be_iDescribed as query frequency (TF) representation q_iI.e. a bag-of-words representation weighted with query frequency.

Referring to fig. 4, the method 10 includes:

in step S102, semantic embedding is performed based on the click.

And acquiring click data of the user, and performing semantic embedding based on the click data.

After all queries linked to one image in the bipartite graph are combined into one query set, each query set establishes a unique link with each image, so that the query space and the image space are naturally in one-to-one correspondence. However, since the representations of the query and the image are absolutely heterogeneous, the similarity between them cannot be directly calculated. One solution we pursued in the work of the present invention is to rely on cross-modal embedding learning, which assumes the existence of a low-dimensional embedding space for the query set and the image representation.

We first introduce how to learn semantic embedding using deep architecture to map query representations to the embedding space described above. In particular, the TF of each query in a given query space represents q_iIn the past, the click-based cross-modal model directly converts a highly sparse query representation into a low-dimensional embedding space, which often makes the optimization difficult to converge. Here, we merge all queries in each query set weighted by their corresponding number of clicks and learn semantic embedding based on the cumulative representation of the query set, aiming to mitigate the high sparsity of query features. Technically, since one image may correspond to multiple issued queries, and one issued query may correspond to multiple images, we assume that in bipartite graphs

M unique images, set s_iTo have a link to the image v_iThe query set of edges. Thus, the query set s can be computed as follows_iIs represented by:

wherein

For query set representation, d_qIs a feature dimension. Then, a single-layer neural network is applied as a click-based query set tableDisplay device

Generating query semantic embeddings, namely:

wherein the content of the first and second substances,

representation of a query set in an embedding space, d_eFor the dimension of the embedding space, f_q(. h) is a mapping function of the click-based semantic embedding layer. The neural network encodes the syntax and semantics of the query set.

In step S104, visual embedding is performed based on the attention.

To learn visual embedding from an original image, a deep Convolutional Neural Network (CNN), a CNN architecture widely used for image classification, is employed to learn an image representation. However, unlike the prior art which uses the depth structure and uses the output of the fully connected layer as the image representation, we select the output feature map of the convolutional layer to represent the original image, and the convolutional layer contains more spatial information. Specifically, the feature map of the last convolution layer of AlexNet is denoted as conv₅With a dimension K × K × D, where K × K is the number of regions in the feature map, D represents the feature vector dimension for each region (K13, D256 in the experiments of the invention)_j∈R^D,j∈[1,K²]Where j is the index of each region. Thus, for image v_iTo say that from K²The global image feature map composed of the D-dimensional local descriptors is expressed as:

each local descriptor divides the feature map into different overlapping regions in the original image. We refer to these local descriptors as feature cubes, as shown in fig. 2 and 3.

In many cases, the semantics of the query are only relevant to certain regions of the corresponding image. Therefore, similarity learning between a query and an image using one global feature vector over the entire image may lead to sub-optimal results due to noise from regions unrelated to the query. We get inspiration from the attention mechanism that is almost a de facto standard in the task of sequence learning, apply the attention mechanism to the extracted image feature map, focus only on the relevant regions, thereby enhancing the image representation learning for similarity measurement. This design of a consideration mechanism makes it possible to accurately determine regions that are highly relevant to a query and further incorporate the contributions of different regions into the generated image representation. In particular, a given image feature matrix

We first input it into a single-layer neural network and then generate the attention distribution for all regions of the image using the softmax function:

wherein

And

is a matrix of parameters, and is,

and

representing the deviation, tanh (-) is a standard non-linear function. It should be noted that d_aIndicating the size of the hidden layer in the attention layer.Therefore, the temperature of the molten metal is controlled,

is a K²Dimensional vector, corresponding to the attention probability of each image region, the jth element p of the vector_jIs a particular probability of attention in the image region indexed by j. Based on the attention distribution, we compute a weighted sum of local descriptors for all regions and obtain an aggregate image representation weighted by attention as follows:

since the regions of the image that are potentially most relevant for similarity learning between the query and the image are extracted and weighted with higher attention, the aggregate image representation can be treated as a more informative image representation. However, given the complexity of particularly long queries, it is not always sufficient to locate the correct region with a single attention mechanism. We have derived a heuristic from the successful application of multi-head attention in machine translation and graphical structure modeling, extending the attention mechanism to multi-head attention to facilitate and stabilize attention learning. Specifically, we incorporate M independent attention layers into convolutional neural networks, each of which applies an attention mechanism to perform the feature transformation of equation (6). Final output image representation

Is generated by average pooling of all aggregate image representations in the M attention layers:

in the formula

Representing the normalized attention distribution measured at the end of the mth attention layer,

the average attention distribution calculated by a multi-head attention mechanism mode is shown.

Finally, a visual embedding layer is applied to represent the final output image

Mapping into an embedding space.

In the formula

Representing the image representation in the underlying space, f_υ(. cndot.) is a mapping function of the image embedding layer.

In step S106, the query set and the corresponding images are projected into the low-dimensional embedding space through the above-mentioned click-based semantic embedding and attention-integration visual embedding, and target training is performed.

Through the click-based semantic embedding and attention-integrated visual embedding described above, a query set and corresponding images are projected into a low-dimensional embedding space. The similarity between the query and the image can be measured directly by the inner product of the mappings in the embedding space, which is equivalent to L₂Due to the cosine similarity of a layer attached on top of each embedded layer. Next, to learn the overall architecture of the DSPEA, we design a joint training objective, including two cross-modal ranking constraints and a neighborhood structure preservation constraint, to ensure that the neighborhood structure within an image modality is preserved while achieving higher similarity for a query set of relevant images than for irrelevant images.

Cross-modal ordering constraint

E.g. "image v⁺Should compare the image v^-The relative similarity relationship that is more relevant "to the query set s naturally conveys a layer of meaning in the bipartite graph, namely, the image v⁺Has been clicked on by any query in the query set s, but the graphLike v^-And not clicked on by which query in the query set s. Utilizing these relative similarity relationships has proven to be quite effective in learning search functions. Inspired by the relative similarity idea, cross-modal ordering Constraints (CR) are considered in the training target so as to maintain the relative similarity relation between the query set and the images in cross-modal learning. In particular, we can easily obtain a set of three-dimensional arrays from click data

Where each array (s, upsilon)⁺,υ^-) Formed by a set of queries, image v⁺Is clicked by any query in the query set s, and the image upsilon^-Which query in the query set s has not clicked on. To preserve the relative relationships in these three-dimensional arrays, we aim to learn the mapping f of the query set in the embedding space by semantic embedding and visual embedding_q(q_s) More like image mapping

Rather than image mapping

Therefore, the extreme ordering penalty that has been widely applied in information retrieval and computer vision is adopted as a cross-modal ordering constraint:

where margin is a constant parameter that controls the minimum limit between two pairs of distances in the extreme ordering penalty.

Furthermore, in addition to the relative relationship between the query set, the clicked image, and images that are not clicked on by any query in the query set, we explore another similarity relationship by considering neighborhood images that are semantically similar to the single clicked image. Using N (upsilon)⁺)＝{υ^kThe representation contains and images upsilon⁺Semantically similar images v⁺I.e. image v^kAnd upsilon⁺Clicked on in semantically similar queries. Note that if the cosine similarity between the corresponding query set representations of an image pair is greater than 0.8, then this image pair is referred to as a semantic similarity pair. Thus, a query set, a neighborhood image v is given^kAnd an array of unrelated images (s, upsilon) that are not clicked on by any query in s^k,υ^-) We have designed a neighborhood cross-modal ordering constraint (NCR) in this array to additionally measure the relative similarity relationship:

by minimizing the array (s, upsilon)^k,υ^-) The neighborhood cross-modal ordering constraint of (1) preserves the relative similarity relationship of the query set and the image mapping in the potential space, so that the query set maps (f)_q(q_s) More similar to neighborhood image mapping

Rather than extraneous image mapping

Neighborhood structure preservation constraints

Structure preservation or manifold regularization in semi-supervised learning [21]And cross-modal learning [5]Has proven effective. The regularization method indicates that similar points in the original space should be mapped to closely corresponding locations in the potential space. Here we exploit semantic information to estimate underlying structure in image views and develop this structure-preserving regularization method assuming that semantically similar images should have neighboring mappings in the embedding space. In particular, given a clicked-on image v⁺Related neighborhood image upsilon^kUncorrelated images v corresponding to respective query sets s^-Array of compositions (upsilon)⁺,υ^k,υ^-) Defining a structure preservation regularization method, namely, intra-modal neighborhood structure preservation constraint (NSP) as:

minimization of this term will preserve the neighborhood structure within the modality, bringing image maps of similar semantics closer together, while forcing image maps of different semantics embedded in space away from each other.

Total training objective

Our overall training objective function of DSPEA integrates the cross-modal ordering constraint in equation (9), the neighborhood cross-modal ordering constraint in equation (10), and the neighborhood structure preservation constraint in equation (11). Thus, we have the following overall loss function:

and Q is a four-dimensional array set, and each four-dimensional array consists of a query set s, a clicked image, a relevant neighborhood image and an irrelevant image. In the training phase, in order to optimize the overall goal in equation (12), we design a sort loss layer on top of the semantic embedding layer and the visual embedding layer. The ordering loss layer does not have any parameters. In the learning process, the ranking loss layer evaluation model violates the conditions of cross-modal ranking constraint and neighborhood structure preservation constraint, and reversely propagates the gradient to the lower layer, so that the lower layer can adjust parameters and minimize the overall loss.

In step S108, image retrieval is performed based on the keyword.

By optimizing the overall architecture, we get the semantic and visual embeddings defined in equations (2) and (8), respectively. Next, a test query and image pair are given

Computing the inner product of the query map and the image representation in embedding space:

the result of equation (13) reflects the relevance of a given image in response to a query, with higher values indicating higher relevance. Thus, given a text query, an ordered list of response images is generated by ordering the values of the query image pairs.

For the above image retrieval method, we have conducted experiments using the Clickture dataset and evaluated our method in keyword-based image search.

The data in the Clickture data set was sampled from a one-year click log of a commercial image search engine and consisted of two subsets, the training set and the development (dev) set. In particular, the Clickture training set includes 2310 ten thousand three-dimensional arrays { q, upsilon, c } in the click log, where each three-dimensional array represents a total of c clicks on the query image upsilon in the search results issued q. There are 1170 ten thousand different queries and 100 ten thousand images that are different from each other in the training set. FIG. 5 is an example of a click data set, each row listing click counts of images responsive to a query displayed in a first row, according to an example. As shown in FIG. 5, 4 example queries are randomly selected, along with their images and the number of clicks they clicked on in the training set. It is easy to find that pictures with high hits are semantically more relevant to issued queries than pictures with low hits. The development set included 79,926 query-image pairs generated from 1000 disparate queries. Each query-image pair is artificially labeled on a three-point order scale: good, good and inferior. The development set originates from the MSR-Bing image retrieval challenge 2013/2014. All data and partitions are formally published by the proprietor of the Clickture dataset. As an official evaluation of the Clicktune dataset, we treated the development set as a test set and reported the performance of the model of the present invention based on the development set.

In our evaluation, we estimate the similarity of each query-image pair in the development set, and then, for each query, we rank the images according to their similarity to the query.

Parameter setting

For fair comparison, we set up to use the top 50,000 most frequent words as a vocabulary of words for TF features used to generate queries, and conv in AlexNet₅The output of the feature map on the layer is represented as an image in the DSPEA. Further, feature mapping at the res5c level in ResNet-152 is one of the most advanced architectures, also used in DSPEA to study the impact of different image features on search performance. We set the size of the hidden layer of each attention layer to d_aThe margin value of the ordering loss layer is 0.5, 256. We apply M-3 independent attention layers to the multi-head attention mechanism. Dimension d of the embedding space_eIn the

range

40,60,80, 100. We mainly implement the DSPEA model based on Caffe, which is a widely adopted deep learning platform. We used the entire training set, 2310 ten thousand query, image, click count three-dimensional arrays for training the DSPEA model. In the training phase, with an initial learning rate of 0.01 and a small batch size set of 512, the loss of our DSPEA model will be reduced to 23% of the initial model and reach a reasonable value after 60K iterations (about 30 epochs). In terms of time cost, it takes 30 hours to train DSPEA on one NVIDIA Tesla V100 GPU (16 GB).

Evaluation index and comparison method

We apply official evaluation criteria, using Normalized Discounted Cumulative Gain (NDCG) as a performance index. We compared the following performance evaluation methods:

typical correlation analysis (CCA) finds two linear mappings to convert queries and images into one shared potential subspace, where the correlation of the two modalities is maximized. Classical CCA may be further extended to utilize coring and deep learning techniques, respectively. The former learns the nucleated nonlinear projection in CCA, and the latter establishes a mapping model through nonlinear transformation of a plurality of support layers in CCA. We have named these two CCA extensions as kernel CCA (kcca) and deep CCA (dcca).

Cross-modal learning (CCL) based on click data constructs a potential subspace by minimizing the cross-modal distance between the query map and the images weighted according to the number of clicks, and preserving their inherent structure in the original feature space.

Rank Canonical Correlation Analysis (RCCA) learns the similarity of queries to images in two steps: (1) initially training a query and two linear mappings of an image with CCA; (2) the two mappings are further adjusted to learn the bilinear similarity function by preserving the relative relationships implicit in the click data.

Depth visual semantic embedding (DeVisSE) optimizes the projection layer and similarity metric to produce higher dot product similarity between image maps and query maps in the embedding space than between image maps and other randomly selected query maps.

Bag-of-words based deep neural networks (BoWDNN) project images to a bag-of-words query space by training a deep neural network using cosine similarity between the query and the images.

An image and query embedding layer is trained based on deep visual semantic embedding of click data (C-DVSE) to minimize the distance between images linked to click data and a query set map.

Click-based depth structure preserving embedding with visual attention (DSPEA) is proposed by the present invention. DSPEA and DSPEA (res5c) represent conv in AlexNet, respectively₅Feature mapping of layers and feature mapping of res5c layer in ResNet-152 developed with DSPEA. Furthermore, three other different settings of DSPEA are named DSPE (fc8), DSPE (conv)₅) And DSPEA^-They use the output of fc8 layer or the conv of AlexNet, respectively₅The average fused convolution descriptors in a layer serve as an image representation without regard to the attention mechanism, and a single query is employed as a semantic representation rather than a set of queries.

We illustrate the convergence of the Clicktube dataset training algorithm by plotting the training loss curve and the validation loss curve. Note that the loss curve here is generated in the case where the embedding space dimension is 80, and the trends of the curves of other dimensions are similar. Fig. 6 shows a variation of training loss versus validation loss as the number of training iterations increases. The dimension of the embedding space here is 80. As shown in fig. 6, as expected, both the training loss and the validation loss decrease as the training iterations increase. Further, the air conditioner is provided with a fan,after a number of iterations (60 × 10 in this experiment)³Second), the loss fluctuation of both is very smooth.

Table 1 shows the NDCG performance of image searches run 13 times in the Clickture development dataset averaging over 1,000 queries per time. Notably, BoWDNN takes the query space as a common space and maps images into a query space of dimension 50,000, while for other runs, the performance here is given based on an 80-dimensional embedding space. In general, our proposed DSPEA is always superior to other NDCGs operating at different depths. In particular, the NDCG @25 performance of DSPEA can reach 51.92%, which is a certain improvement compared with the 1.19% of the optimal competitor DCCA, and this is generally regarded as a significant improvement in the Clicktube data set. The NDCG @25 of the DSPEA (res5c) can achieve the highest performance disclosed so far, 52.21%, by upgrading the feature mapping from the conv5 layer in AlexNet to the res5c layer in ResNet-152. The performance of NDCG @25 may be further improved to 55.63% by performing a random walk to reorder the results of the DSPEA (res5 c). The DSPEA additionally integrates structural retention by exploiting the relative relationships in cross-modal embedded learning, resulting in improved performance over CCA and KCCA. There are performance gaps between RCCA, CCL and DSPEA. Although these three runs all involve the use of structural retention or preference relationships, different strategies are employed to learn projections. The learning of RCCA is where the cross-modal ordering constraint is uniquely optimized, the intra-modal structure retention constraint is considered in the CCL, and our DSPEA considers both constraints together. The results show substantial advantages of learning embedding by maintaining both cross-modal similarity relationships and intra-modal neighborhood structure in the underlying click data.

TABLE 1

Method of producing a composite material	NDCG@5	NDCG@10	NDCG@15	NDCG@20	NDCG@25
						CCA	59.55％	58.48％	55.38％	52.85％	50.51％
KCCA	59.75％	58.55％	55.45％	52.87％	50.60％
						DCCA	61.34％	59.91％	56.45％	53.64％	51.31％
CCL	59.85％	58.65％	55.55％	52.89％	50.63％
						RCCA	60.75％	59.44％	56.25％	53.53％	51.12％
DeViSE	60.46％	59.08％	55.78％	53.27％	51.10％
						BoWDNN	60.68％	59.00％	55.83％	53.14％	50.89％
C-DVSE	60.95％	59.41％	56.21％	53.58％	51.30％
						DSPE(fc8)	61.84％	60.06％	56.75％	53.97％	51.54％
DSPE(conv₅)	61.13％	59.50％	56.12％	53.34％	51.20％
						DSPEA^-	61.39％	59.65％	56.36％	53.69％	51.33％
DSPEA	62.72％	60.29％	57.01％	54.27％	51.92％
						DSPEA(res5c)	63.03％	60.72％	57.28％	54.48％	52.21％

The performance of DSPEA is superior to DCCA, DeVise, BoWDNN and C-DVSE. With these 5 methods using deep neural networks, they adopt different ways in learning the embedding space. BoWDNN directly uses the original query space as an embedding space, while DSPEA is implemented by studying a potential, common visual-semantic embedding space. Our experimental results show that learning a common embedding space can better measure query-image similarity and continuously improve performance. Further, the DSPEA also benefits from utilizing the relative relationships and improving the other two modes of operation, DCCA and CDVSE. Furthermore, the performance of the DSPEA is always better than that of the DeVisE, which ensures the effectiveness of introducing structure-preserving constraints in embedded optimization.

With DSPEA representing queries only^-Compared to DSPEA based on a query set, it shows better performance. Therefore, merging queries linked to an image into a set can effectively represent a query modality. Although there are three modes of operation-DSPE (conv)₅) DSPE (fc8) and DSPEA all originate from conv₅Layers, but they are different in nature when generating the image representation. The expression in DSPE (fc8) is that conv is₅All core maps on a layer are tiled onto neurons of a fully connected layer, while DSPE (conv)₅) And DSPEA are obtained by fusing local descriptors in the conv5 layer on average or by linear fusion weighting them based on their attention probability, respectively. As shown by our experimental results: DSPE (fc8) can continuously obtain the ratio DSPE (conv)₅) Better performance, but still lower than DSPEA, again validating our scheme. Furthermore, we utilized a statistical validation test, the randomization test, to verify that the performance improvement of our DSPEA over other methods is not accidental. Wherein, the iteration number of the randomization is set to 10 ten thousand, and the effectiveness level is 0.05. We found that DSPEA is much better than other methods.

Fig. 7 shows the results of a picture search by ten different methods according to an example. Wherein (a) is 'customized mini rocket in pocket' and (b) is 'free time cut and paste painting'. The relevant scale is shown in the lower right hand corner of each picture. FIG. 7 shows top ten image search results for "custom mini-rocket" and "free time clip art" using different query methods. We can see that the DSPEA proposed by the present invention achieves the most satisfactory ranking results. In particular, the query "customized mini-rocket" DSPEA retrieved nine suitable pictures from the top ten search results, which is better than other methods. It can be observed that in general, DSPEA is expected to achieve the best performance as long as the semantics of the query are specified. For queries that deliver fuzzy semantics, the search results of the DSPEA will inevitably be affected because such queries may be related to images with a variety of visual representations, and the semantic relevance between the query and the image becomes very weak. Fig. 8 is a result of picture search by ten different methods shown according to another example. Taking the "interesting icon" as an example, in this case our DSPEA only retrieved seven of the top ten ranked pictures, as shown in FIG. 8, while C-DVSE or DeVisSE obtained a better ranked list.

Query representation

One common problem with keyword-based image searches is the need to represent text queries. In previous experiments, we used the TF representation for reasonable comparison with the most advanced methods. Furthermore, we performed experiments testing the search performance of other alternatives. The results show that using different query representations does not cause significant differences. Taking NDCG @25 as an example, performance only fluctuated around 0.06% when using TF, TF-IDF and BM25 as query representations. This in effect simplifies the selection of the query representation.

Attention mechanism

Next, we will use multi-head attention in DSPEA to compare with single layer attention (DSPEA-SL) and superimposed attention (DSPEA-ST). The single-layer attention mechanism employs a single-layer attention network for attention generation, while the stacked attention mechanism utilizes a multi-layer attention network to locate image regions relevant to a query through multi-step reasoning. The values of NDCG @25 obtained using the DSPEA-SL, DSPEA-ST and DSPEA methods in the Clicktube dataset were 51.81%, 51.86% and 51.92%, respectively. The performance trend of NDCG at other depths is similar to NDCG @ 25. The results of the study essentially show that jointly parallel studying a multi-point attention function is superior to studying a single layer or stack with only a single attention function.

Visualization of visual attention

We further analyzed to elucidate the progression of using M-3 independent attention layers in a multi-head attention mechanism to determine relevant image regions for a similarity measure between a query and an image. FIG. 9 is an illustration of four image examples using a multi-head attention mechanism, according to an example. In each example, the 5 images from left to right are in turn the original image, the attention map of three independent attention layers, and the average attention map of multiple head attention, where the intensity represents the intensity of focus. Note that the size of the original image is 224 x 224, the size of the attention map is 13 x 13, we upsample the attention distribution and apply a gaussian filter to make it the same size as the original image. From these example results, it is readily seen that the attention in the three independent attention layers is spread over the different objects/scenes in the image that are relevant to the query, while the average attention of the multi-head attention is focused on all regions relevant to the query. Taking fig. 9(a) as an example, the model turns attention to air/background, michael jordan and basket in three separate attention layers, respectively, whereas for the entire multi-head attention mechanism, attention is focused on all regions relevant to responding to a query "michael jordan in the air".

Influence of embedding dimension

Figure 10 is a graphical representation of NDCG performance curves for different embedding dimensions using different methods. Performance curves for eight methods (i.e., NDCG @5, NDCG @10, NDCG @15, NDCG @20 and NDCG @25) for different depths of NDCG to explore the impact of different embedding dimensions (i.e., 40,60,80 and 100). Note that BoWDNN directly treats the original query space as the embedding space, resulting in a fixed dimension of the embedding space, and thus the method is omitted here. Overall, the performance gap between the other 7 run methods and our proposed DSPEA method is very significant for each dimension of the embedding space. In particular, the performance of the DSPEA peaks across different depths of NDCG when the embedding dimension is 100. At the same time, the NDCG performance of the DSPEA also fluctuates slightly with the variation of the potential subspace dimension at various depths, which in effect simplifies the choice of embedding dimension.

Effects of individual constraints

As three constraints, namely a cross-modal ordering Constraint (CR), a neighborhood cross-modal ordering constraint (NCR), and an intra-modal neighborhood structure preservation constraint (NSP), optimization is jointly performed in our DSPEA. The degree of contribution of each constraint was investigated here. Fig. 11 illustrates the improvement in performance of NDCG @25 over different combinations of constraints by using cross-modal ordering (CR) constraints only across different dimensions of the embedding space. As shown in fig. 11, the degree of improvement of the results for the three different combinations (i.e., CR + NCR, CR + NSP, CR + NCR + NSP) over the use of CR alone is shown. The results across different dimensions of the embedding space continue to show that greater performance gains can be obtained by learning using three constraints than using two constraints. Furthermore, the use of CR + NSP also works better than CR + NCR, which is understood to mean that CR and NSP are performed from the perspective of cross-modal and intra-modal constraints, respectively, which are complementary to each other.

Run time

For online searching, our DSPEA needs to run for approximately 1 minute on a common personal computer (Intel kernel i 7-47703.40 GHz CPU and 16gb RAM) to complete the similarity measure of all 79,926 pairs of query image pairs in the development set. In other words, computing the similarity of one of the query-images pairs takes less than 1 millisecond, which is fast enough for the example response.

The embodiment of the invention provides a click-based Depth Structure Preserving Embedding (DSPEA) model with visual attention, and researches cross-modal learning and click data-based learning query-image similarity. In particular, we have optimized the overall architecture of the embedded model by,

while preserving the cross-modal relative ordering relationships and neighborhood structure within the view. To better represent the image and query space, attention mechanisms have further been introduced in CNNs to locate image regions relevant to queries and to explore ways to merge queries into query sets to solve sparsity problems. Our protocol and analysis was validated by experiments performed on the Clickture dataset. The performance improvement is evident compared to other cross-modal embedding techniques. More strikingly, our DSPEA achieved the highest performance to date on Clickture datasets.

It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. The computer program, when executed by the CPU, performs the functions defined by the method provided by the present invention. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

Referring to fig. 12, the image retrieval apparatus 20 includes: semantic embedding module 202, visual embedding module 204, target training module 206, and image retrieval module 208.

The semantic embedding module 202 is configured to perform semantic embedding based on the click of the image retrieval performed by the obtained user.

The visual embedding module 204 is configured to perform visual embedding based on an attention mechanism.

The target training module 206 is configured to project the query set and the corresponding image into a low-dimensional embedding space through the click-based semantic embedding and the attention-integrated visual embedding, so as to perform target training.

The image retrieval module 208 is used for performing image retrieval based on the keywords.

In some embodiments, semantic embedding module 202 includes: the query set merging unit comprises a bipartite graph construction unit, a query set merging unit and a semantic embedding unit. The bipartite graph construction unit is used for constructing a bipartite graph based on the click data, and the bipartite graph comprises a plurality of images and at least one query of each image. The query set merging unit is used for merging the at least one query of each image into one query set respectively. The semantic embedding unit is used for carrying out the following operations on each query set: merging the at least one query weighted by the number of clicks of the query in the query set and learning semantic embedding based on a cumulative representation form of the query set to generate click-based query set representations of the query set, respectively; and applying a single-layer neural network to generate a query set representation of the query set in the embedding space for the click-based query set representation of the query set.

In some embodiments, the visual embedding module 204 includes: a feature map determination unit, an aggregate image representation determination unit, an output image representation determination unit, and an image embedding unit. The feature mapping determination unit is used for determining the overall image feature mapping of the image based on the output feature mapping of the convolution layer of the deep convolutional neural network aiming at each image, and the overall image feature mapping comprises local descriptors of a plurality of regions. The aggregation image representation determining unit is used for merging the M attention layers into the deep convolutional neural network, and performs the following operations: inputting the overall image features into the M attention layers respectively to generate attention distribution of the images of the attention layers; and performing weighted combination on the local descriptors of the plurality of regions respectively based on the attention distribution of each attention layer to generate M aggregated image representations of the image. An output image representation determining unit is configured to obtain an output image representation of the image by averaging the M aggregated image representations of the image. The image embedding unit is used for applying a visual embedding layer of the deep convolutional neural network to embed the output image representation into the embedding space so as to obtain an image representation of the image in the embedding space.

In some embodiments, the target training module 206 includes: a loss function determining unit, configured to determine, for each query set, a loss function for target training, including: determining a cross-modal ranking constraint based on a limit ranking loss according to the query set, a first image clicked by any query in the query set and a second image not clicked by any query in the query set; determining a neighborhood cross-modal ordering constraint based on a limit ordering loss according to the query set, a third image with similar semantics with the first image and the second image; determining neighborhood structure preservation constraints based on structure preservation regularization according to the first image, the second image and the third image; and determining the loss function according to the cross-modal ordering constraint, the neighborhood cross-modal ordering constraint and the neighborhood structure retention constraint.

In some embodiments, the target training module 206 further comprises: and the target training unit is used for performing target training on the query set and the images based on the loss function so as to minimize the total loss.

In some embodiments, the image retrieval module 208 includes: and the image retrieval unit is used for retrieving the images in the embedding space based on the text query and the ranking of the inner products of the images in the embedding space according to the text query provided by the keyword.

It is noted that the block diagrams shown in the above figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

FIG. 13 is a block diagram illustrating a computer system in accordance with an exemplary embodiment. It should be noted that the computer system shown in fig. 13 is only an example, and should not bring any limitation to the function and the scope of the application of the embodiment of the present invention.

As shown in fig. 13, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a transmitting unit, an obtaining unit, a determining unit, and a first processing unit. The names of these units do not in some cases constitute a limitation to the unit itself, and for example, the sending unit may also be described as a "unit sending a picture acquisition request to a connected server".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:

performing semantic embedding based on the acquired click data for image retrieval of the user;

performing visual embedding based on an attention mechanism;

projecting the query set and the corresponding images into a low-dimensional embedding space through the semantic embedding and the visual embedding, and performing target training; and

and performing image retrieval based on the keywords.

Exemplary embodiments of the present invention are specifically illustrated and described above. It is to be understood that the invention is not limited to the precise construction, arrangements, or instrumentalities described herein; on the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. An image retrieval method, comprising:

performing visual embedding based on an attention mechanism;

and performing image retrieval based on the keywords.

2. The method of claim 1, wherein performing semantic embedding based on the obtained click data for image retrieval by the user comprises:

constructing a bipartite graph based on the click data, the bipartite graph including a plurality of the images and at least one query for each of the images;

merging the at least one query of each of the images into one of the query sets, respectively; and

for each of the query sets, performing the following operations:

merging the at least one query weighted by the number of clicks of the query in the query set and learning semantic embedding based on a cumulative representation form of the query set to generate click-based query set representations of the query set, respectively; and

and applying a single-layer neural network to generate a query set representation of the query set in the embedding space for the click-based query set representation of the query set.

3. The method of claim 2, wherein performing visual embedding based on an attention mechanism comprises: for each of the images, the following operations are performed:

determining an overall image feature map for the image based on an output feature map of convolutional layers of a deep convolutional neural network, the overall image feature map comprising local descriptors for a plurality of regions;

combining the M attention layers into the deep convolutional neural network, and performing the following operations:

inputting the overall image features into the M attention layers respectively to generate attention distribution of the images of the attention layers; and

weighting and combining the local descriptors of the plurality of regions based on the attention distribution of each of the attention layers, respectively, to generate M aggregated image representations of the image;

obtaining an output image representation of the image by averaging the M aggregated image representations of the image; and

applying a visual embedding layer of the deep convolutional neural network to embed the output image representation into the embedding space to obtain an image representation of the image in the embedding space.

4. The method of claim 3, wherein projecting a query set and corresponding images into a low-dimensional embedding space by the semantic embedding and the visual embedding, performing target training comprises: for each of the query sets, determining a loss function for target training, comprising:

determining a cross-modal ranking constraint based on a limit ranking loss according to the query set, a first image clicked by any query in the query set and a second image not clicked by any query in the query set;

determining a neighborhood cross-modal ordering constraint based on a limit ordering loss according to the query set, a third image with similar semantics with the first image and the second image;

determining neighborhood structure preservation constraints based on structure preservation regularization according to the first image, the second image and the third image; and

and determining the loss function according to the cross-modal sorting constraint, the neighborhood cross-modal sorting constraint and the neighborhood structure retention constraint.

5. The method of claim 4, wherein projecting a query set and corresponding images into a low-dimensional embedding space by the semantic embedding and the visual embedding, performing target training further comprises: target training the query set and the image based on the loss function to minimize an overall loss.

6. The method of claim 5, wherein performing image retrieval based on the keywords comprises:

and according to the text query provided by the keyword, in the embedding space, based on the ranking of the inner products of the text query and the images in the embedding space, the image retrieval is carried out.

7. The method of any one of claims 1-6, wherein the embedding space is a low-dimensional embedding space for learning similarity between the query set and the image, and wherein the similarity between the query set and the image is directly measured by an inner product mapped in the embedding space.

8. An image retrieval apparatus, comprising:

the semantic embedding module is used for carrying out semantic embedding on the basis of the acquired click data for image retrieval of the user;

the visual embedding module is used for carrying out visual embedding based on an attention mechanism;

the target training module is used for projecting the query set and the corresponding images into a low-dimensional embedding space through the semantic embedding and the visual embedding so as to carry out target training; and

and the image retrieval module is used for carrying out image retrieval based on the keywords.

9. A computer device, comprising: memory, processor and executable instructions stored in the memory and executable in the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the executable instructions.

10. A computer-readable storage medium having stored thereon computer-executable instructions, which when executed by a processor, implement the method of any one of claims 1-7.