WO2017168125A1

WO2017168125A1 - Sketch based search methods

Info

Publication number: WO2017168125A1
Application number: PCT/GB2017/050825
Authority: WO
Inventors: Yi-zhe SONG; Tao Xiang; Timothy HOSPEDALES
Original assignee: Queen Mary University Of London
Priority date: 2016-03-31
Filing date: 2017-03-23
Publication date: 2017-10-05

Abstract

The overall system 1of an embodiment of the invention has three main parts: a fine-grained retrieval engine 2, a selective sketch interactive module 3 and an augmenting sketch interactive module 4. An interface 5handles communication with a user. The fine-grained retrieval engine 1 is a model trained from a database of sketches and photos. It can then be used non-interactively to retrieve photos similar to an input sketch, or interactively via one of the two interactive modules 3, 4. The interactive interface which can be of a selective type or an augmenting type.

Description

SKETCH BASED SEARCH METHODS

[0001] The present invention relates to sketch-based search methods, i.e. methods in which a hand- drawn sketch is the basis for a search amongst a library of images, and in particular to fine-grained searches, that is methods which return specific matching images rather than image categories.

[0002] Most image searching methods currently known take either text or another image as the input query. If the input query is text, the search method looks for images associated with the query text, e.g. by looking in the metadata of the images or in the context around the images. Image metadata may derive from pre-classification of the searched images. If the input query is an image, then the search method looks for similar images, e.g. using pattern recognition algorithms.

[0003] Text-based image searches work well for images of specific people, places or things if the user knows the proper name for the target. If the target is well known enough then a suitably labelled image can usually be found quite easily. However, if the user does not know the proper name of the target but tries to search by a textual description, for example "red high-heeled shoes with a bow on the toe", then the results are dependent on the user's textual description highlighting the same features and using the same terms as any description applied to the searched images. Thus the accuracy of the search is likely to be dependent on the user's use of common terminology for important features of the search target.

[0004] Image-based image searches require the user to have an image similar to that being searched for, which the user may not have. In addition, image-based search methods mostly consider the whole image and therefore, if the user desires to find other images of a foreground object in the search query image, images with that foreground object in a different orientation or against a different background may not be found.

[0005] There is therefore a need for an alternative or improved method by which a user can search for specific instances of a target object, in particular which does not require the user to have detailed knowledge or another image of the target object.

[0006] An aspect of the present invention provides a method of searching for images of a target object instance, the method comprising:

receiving sketch data representing a hand-drawn sketch of the target object from a user; using a deep triplet ranking model to compare the sketch data to a gallery of images of the same object category to obtain a ranked list of images;

providing the ranked list of images to the user.

[0007] In an embodiment the deep triplet ranking model is a convolutional neural network.

[0008] In an embodiment the convolutional neural network is a Siamese network.

[0009] In an embodiment the deep triplet ranking model is a multi-task model. [0010] In an embodiment the multi-task model has been trained to learn an auxiliary task comprising an attribute prediction task and/or an attribute learning task.

[0011] In an embodiment the sketch data includes information representing the order of strokes in the sketch.

[0012] An embodiment further comprises:

receiving a user selection of an image from the ranked list and additional sketch data representing a drawing the user has performed on the selected image;

identifying the part of the object in the selected image indicated by the additional sketch data;

using a strongly-supervised deformable part-based model to compare the part of the object to corresponding parts of the images of the gallery of images to obtain an updated ranked list of images; and

providing the updated ranked list of images to the user.

[0013] An embodiment further comprises:

combining the sketch data with the additional sketch data to obtain augmented sketch data; using the deep triplet ranking model to compare the augmented sketch data to the gallery of images of objects to obtain a second ranked list of images;

using an image-domain neural network to compare the selected image to the gallery of images to obtain a third ranked list of images; and

providing the second and third ranked lists of images to the user.

[0014] In an embodiment the second and third ranked lists of images are provided to the user as a merged updated ranked list of images.

[0015] An aspect of the invention also provides a method of training a neural network to perform fine-grained sketch-based image retrieval, the method comprising:

receiving a training image gallery comprising a plurality of images of objects and attribute data relating to the objects;

receiving a training sketch gallery comprising a plurality of sketches of objects and attribute data relating to the objects;

generating a plurality of triplets using the attribute data and/or data-driven feature representation, each triplet comprising a sketch of a target object, a positive image representing an object similar to the target object and a negative image representing an object dissimilar to the target object;

training the neural network using the plurality of triplets.

[0016] In an embodiment generating a plurality of triplets includes selecting positive images and/or negative images by extracting features using a category-trained ranking model. [0017] An embodiment further comprises generating a plurality of additional sketches by modifying sketches of the training sketch gallery.

[0018] In an embodiment modifying sketches comprising selectively removing strokes from a sketch.

[0019] In an embodiment selectively removing strokes comprises randomly removing strokes with a probability based on stroke length and stroke order.

[0020] In an embodiment modifying sketches comprises deforming strokes of sketches individually.

[0021] In an embodiment modifying sketches comprises deforming sketches as a whole.

[0022] An embodiment further comprises pre-training the neural network using images to recognise a plurality of categories of object.

[0023] An embodiment further comprises fine-tuning the pre-trained model using sketches to recognise a plurality of categories of object.

[0024] In an embodiment the neural network is a three-branch triplet network.

[0025] In an embodiment the training objective is a triplet ranking objective.

[0026] In an embodiment the neural network is trained to perform an auxiliary task.

[0027] In an embodiment the auxiliary task comprises an attribute prediction task and/or an attribute ranking task.

[0028] In an embodiment the training is performed with a plurality of hard triplets selected from the automatically generated triplets.

[0029] Accordingly, the present invention can provide an interactive method of searching using a sketch, e.g. done on a touchscreen device, to input the search query. The present invention can provide fine-grained instance-level retrieval of images, where at each stage of the interactive iteration the input sketch is augmented with more details through directly sketching on the retrieved images, resulting in increasingly fine-grained matches that are closer to the originally intended object.

[0030] By performing instance-level (rather than category level) retrieval, the invention provides a practical user interface for searching, particularly with the wide and increasing availability of touchscreens.

[0031] The invention will be described below with reference to exemplary embodiments and the accompanying drawings, in which:

[0032] Figure 1 is a diagram of an interactive fine-grained sketch-based image retrieval system;

[0033] Figure 2 is a diagram of a selective interactive module;

[0034] Figure 3 is a diagram of an augmenting interactive module;

[0035] Figure 4 is a diagram illustrating a method of training a model;

[0036] Figure 5 is an example of a selective sketch on a retrieved image in a method of the invention; and

[0037] Figure 6 is an example of an augmenting sketch on a retrieved image in a method of the invention. [0038] Figure 7 depicts parts of a shoe to which attributes can be applied;

[0039] Figure 8 depicts examples of photos and corresponding sketches used for training the model of an embodiment of the invention;

[0040] Figure 9 depicts a training network in an embodiment of the invention;

[0041] Figure 10 depicts an example of a query sketch and positive and negative edge -extracted photos;

[0042] Figure 11 depicts examples of original sketches and generated sketches after removing 10%, 30% and 50% of strokes;

[0043] Figure 12 depicts an example process of data augmentation by stroke removal and deformation;

[0044] Figure 13 depicts examples of local deformation of sketches in an embodiment of the invention;

[0045] Figure 14 depicts examples of global deformation of sketches in an embodiment of the invention;

[0046] Figure 15 depicts examples of combined local and global deformation of sketches in an embodiment of the invention; and

[0047] Figure 16 depicts the network architecture of a model according to an embodiment of the invention;

[0048] Figure 17 depicts ranked lists generated automatically and by humans; and

[0049] Figure 18 depicts ranked lists generated by the embodiment of Figure 16 and according to the embodiment of Figure 1.

[0050] The present invention provides methods and systems that can accept user-created sketches as the input query. Sketches are intuitive and descriptive. They are one of the few means for non-experts to create visual content. As a query modality, they offer a more natural way to provide detailed visual cues than pure text. With the proliferation of touch-screen devices, sketch-based image retrieval (SBIR) has gained tremendous application potential.

[0051] Traditional computer vision methods for SBIR mainly focus on category-level retrieval, where intra-category variations are neglected. This is not desirable, since if given a specific shoe sketch (e.g., high-heel, toe-open) as input, it would be pointless to retrieve an image that is indeed a shoe, but with different part semantics (e.g., a flat running shoe). Thus fine-grained SBIR is desirable as a way to go beyond conventional category -level SBIR, and fully exploit the detail that can be conveyed in sketches. By providing a mode of interaction that is more expressive than the ubiquitous browsing of textual categories, fine-grained SBIR according to the invention has potential to provide practical commercial adoption of SBIR technology.

[0052] Fine-grained SBIR is challenging due to: (i) free-hand sketches are highly abstract and iconic, e.g., sketched objects do not accurately depict their real-world image counterparts, (ii) sketches and photos are from inherently heterogeneous domains, e.g., sparse black line drawings with white background versus dense colour pixels, potentially with background clutter, (iii) fine-grained correspondence between sketches and images is difficult to establish especially given the abstract and cross-domain nature of the problem.

[0053] Known proposals for retrieving images or 3d models based on sketches, typically with Bag Of Words (BOW) descriptors or advancements thereof, can be effective and scalable, but are weak at distinguishing fine-grained variations as they do not represent any semantic information. Very recently, approaches to fine-grained SBIR have included DPM -based part modelling in order to retrieve objects in specific poses [Y. Li, T. M. Hospedales, Y.-Z. Song, and S. Gong. Fine-grained sketch-based image retrieval by matching deformable part models. In BMVC, 2014]. However, for practical SBIR in commercial applications, it is desirable to distinguish subtly different object subcategories rather than different poses.

[0054] In a related line of work, fine-grained attributes have recently been used to help drive finegrained image retrieval by identifying subtle yet semantic properties of images (K. Duan, D. Parikh, D. Crandall, and K. Grauman. Discovering localized attributes for fine-grained recognition. In CVPR, 2012.35; A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. In CVPR, pages 192-199, 2014). Moreover, such attributes may provide a route to bridge the sketch/photo modality gap, as they are domain invariant if reliably detected (e.g., a high-heel shoe is 'high-heel' regardless if depicted in a photo or sketch). However, they suffer from being hard to predict due to spurious correlations (D. Jayaraman, F. Sha, and K. Grauman. Decorrelating semantic visual attributes by resisting the urge to share. In CVPR, pages 1629-1636, 2014). An embodiment of the invention brings together attribute and part-centric modelling to decorrelate and better predict attributes, as well as provide two complementary views of the data to enhance matching.

[0055] The present invention can provide a part-aware SBIR framework that addresses the finegrained SBIR challenge by identifying discriminative attributes and parts. Specifically, an off-the- shelf strongly-supervised deformable part-based model (SS-DPM) is first trained to obtain semantic localized regions, followed by low-level feature (e.g. a Histogram of Oriented Features, abbreviated herein as "HOG") extraction within each part region to train part-level attribute detectors (e.g., using conventional Support Vector Machine classifiers). Part decomposition and part-level attribute detection for each and every sketch and photo can be used as an embodiment of the invention.

[0056] Overall system architecture

The overall system 1 of an embodiment of the invention has three main parts: a fine-grained retrieval engine 2, a selective sketch interactive module 3 and an augmenting sketch interactive module 4, as shown in Figure 1. An interface 5 handles communication with a user.

[0057] The fine-grained retrieval engine 2 comprises a ranking model, e.g. a deep ranking model, trained from a database of sketches and photos. It can then be used non -interactively to retrieve photos similar to an input sketch, or interactively via one of the two interactive modules 3, 4. [0058] The interactive interface 5 takes the users' feedback, and refines the returned results based on the user's feedback, which can be of a selective type or an augmenting type. The fine-grained retrieval engine 2, selective sketch interactive module 3, and augmenting sketch interactive module 4 are described in more detail below.

[0059] Both interactive modules can be iteratively called upon to further refine the retrieval results with multiple rounds of feedback. The selective sketch interactive module 3 enables the user to add detail to the original sketch in order to select from an image gallery returned from a previous search. The augmenting sketch interactive module 4 enables the user to sketch on an image from the image gallery returned by a previous search in order to retrieve a new set of results. The main differences between the two interactive modules are described below.

[0060] Firstly, selective sketches provided to the selective sketch interactive module 3 do not need to be processed through the sketch-based retrieval engine; instead, refined retrieval is achieved solely in the image gallery via a separate part-level image retrieval engine.

[0061] Secondly, the augmenting sketch interactive module 4 outputs two sets of data: the actual sketches the user draws and the particular image the user sketched on. The actual sketch, once combined with the original sketch is provided to the sketch retrieval engine to produce a ranking list; and the selected image is provided to a separate image-level retrieval engine to produce another ranking list. The two ranking lists are then merged to generate a final rank list. [0062] Fine-grained ranking model

The training pipeline of the ranking model is shown in Figure 4. Both photo and sketch training data go through a part decomposition and part-level attribute detection component described further below. All sketch training data goes through the data augmentation component that generates more training data to learn from and is described further below.

[0063] Triplet training data is generated using part-level attributes from the part decomposition and part-level attribute detection component and image features. Both augmented sketch data and triplet training data are fed to a triplet ranking network to train it. There are three key points that contribute to solving the problem of training the model to provide improved results. Each provides advantages individually and they synergistically combine to provide greatly improved ranking accuracy.

[0064] Firstly, the invention employs sketch-specific data augmentation to solve the problem of sketch data scarcity - far fewer sketches than photos are available. Data augmentation of sketches can be temporal, spatial or both.

[0065] Secondly, to provide fine-grained/intra-category retrieval, the present invention provides cross-domain triplet ranking, part decomposition and part-level attribute detection and automates triplet annotation using attributes.

[0066] Thirdly, the interactive sketching framework described above allows the user to refine the search results in an iterative process which can involve two types of interactive sketches. Selective sketches indicate parts of interest on retrieved images, e.g., scribbling around a particular decoration on a pair of shoes the user particularly liked (see shoe 6 in Figure 5). Augmenting sketches allow the user to sketch details otherwise not in the images, e.g., sketching a higher heel on top of a retrieved shoe to indicate the desire for a shoe like one sketched on but with a higher heel (see shoe 2 in Figure 6).

[0067] The present invention has been applied, by way of an example, with a fine grained shoe SBIR dataset with images and free-hand human sketches of shoes and chairs. Each image has three sketches corresponding to various drawing styles. This dataset provides a solid basis for learning tasks. The images in the dataset cover most subcategories of shoes commonly encountered in day life. The shoes themselves are unique enough and provide enough visual cues to be differentiated from others. The sketches are drawn by non-experts using their fingers on a touch screen, which resembles the real- world situations when sketches are practically used.

[0068] The shoes in the dataset are tagged with a list of fine-grained attributes for shoes, including words most frequently used to describe a shoe, such as "front platform", "sandal style round", "running shoe", "clogs", "high heel", "great", "feminine" and "appeal". Also included are

functionality descriptions (e.g., sporty) or pure aesthetics (e.g., shiny). A dataset used with an embodiment uses 13 fine-grained shoe attributes, which can be clustered to one of the four parts of a shoe they are semantically attached to, as shown in Fig. 7 [0069] Selective sketch interactive module

The function of the selective module is to make a refined retrieval based on a user's preference for a particular region of a particular retrieved image. The user draws on a chosen retrieved image to indicate that s/he likes the particular highlighted part of that particular image. (E.g., the style of a shoe's heel or toe in a particular retrieved image. Or the style of a chair's back). The ranked list is updated to move examples with parts like the selected one moved higher up the list. A system diagram representing a single loop of user interaction is given in Figure 2. In Figure 2, data nodes are illustrated by ellipses and steps effected by technical components are illustrated by rectangles.

[0070] The input sketch query Dl is applied to the fine-grained ranking model CI which has been trained by a triplet-ranking network. The fine-grained ranking model CI computes the similarity of the input sketch to every photo in an image library and outputs a ranked list of images D2. The user then provides user feedback C2 of selective type by sketching on a specific part of an image. For example, the user draws a circle on an accessory of a particular pairs of shoes he/she likes.

[0071] The part selected by the selective sketch is segmented into a segment image D3. The corresponding segments from the entire image library are compared with the segment image D3. The process to establish part-level correspondences between image-image, and sketch-image is described below. [0072] Using a strongly-supervised deformable part-based model SS-DPM (discussed further below), the part from the selected image is matched C3 to the corresponding part in each other image in the dataset. Any suitable image-domain matching method (e.g., nearest neighbour based on HOG feature) can be used to match between the select image part and the corresponding part of all gallery images. A new ranking list D4 is generated by the image-domain matching method (e.g., sorted by nearest neighbour distances).

[0073] The original rank list D2 and the new rank list from the part selection D4 are fused C4 them to compute a final ranked list that reflects both similarity to the user's initial sketch Dl and selected part D3. The list fusion can be done with any existing fusion method, for example by averaging the distances produced by the two methods C2 and C3. The fusion C4 generates the final rank list D5. Optionally, the process can be iterated. The user can go back to step C2, giving another selective feedback, thus updating the final ranked list.

[0074] Augmenting sketch interactive module:

The interactive module combines an augmenting sketch, drawn on a retrieved image, with the original sketch to generate a new more fine-grained sketch, which gets fed back to the sketch-based finegrained ranking model. It is a way for the user to say "I like this image, but with this <sketched> additional fine-grained detail it's missing". A separate image-level ranking model is also used to produce another rank list with the image the user sketched on as input. The two rank lists, one from the sketch-based retrieval engine, the other from the image-based retrieval engine are merged to produce the final ranked list. A system diagram representing a single loop of user interaction is given in the Fig. 3 and explained below.

[0075] The input sketch query Dll is input to the fine-grained ranking model Cll to obtain a rank list of photos D12. The rank list of photos D12 is updated in each loop. The user provides feedback C12 of augmenting type. This means that the user adds some detail on part of the image he or she prefers. For instance, the user adds a high heel on the boot image in the initial retrieval result. This provides two pieces of information: the augmenting sketch D13, representing the detailed part, and the image D14 which the user sketches on, indicating that the user likes this style.

[0076] With the help of the strongly-supervised deformable part-based model SS-DPM it is known which semantic part the augmenting sketch belongs to on the preferred image. Therefore in a part- based stitching step C13 the corresponding part of the original sketch Dll is replaced by the augmenting sketch D13, to generate a new sketch D15 with the augmenting details. This can be achieved because part-level correspondences across all the images and sketches using can be established using the strongly-supervised deformable part-based model. The ranked list D12 is now updated based on the updated query sketch D15. [0077] In addition, image-domain matching C14 (e.g., Nearest Neighbour (NN) with HOG) is used to rank the image gallery according to similarity with the image D14 the user sketched on thereby generating an image-similarity ranked list D16.

[0078] After the new sketch is fed into the sketch-based retrieval module (CI), the rank list of images returned by the sketch-based retrieval engine is updated (D2 updated). The new rank list and the one generated from the NN method are then fused to generate the final rank list D17. If further feedback is received, this rank list could be updated until the loop is converged.

[0079] Collecting Photo Images

The photo images used cover the variability of the corresponding object category. When collecting the shoe photo images, 419 representative images were selected from UT-Zap50K (A. Yu and K.

Grauman. Fine-Grained Visual Comparisons with Local Learning. In CVPR, 2014) covering shoes of different types including boots, high-heels, ballerinas, formal and informal shoes. For chairs, three online shopping websites, including IKEA, Amazon and Taobao, were searched and chair product images of varying types and styles selected. The final selection consists of 297 images which are representative and cover different kinds of chairs including office chairs, couches, children's chairs, desk chairs, etc..

[0080] Annotation

Since the ultimate goal of the invention is to find the most similar photos to a query sketch, ranking annotation of the training data is desirable. The photo-sketch pair correspondence already provides some annotation that could be used to train a pairwise verification model. However, for fine-grained analysis it is possible to learn a stronger model using a detailed ranking of the similarity of each candidate image to a given query sketch. However, asking a human annotator to ranking all 419 shoe photos given a query shoe sketch would be an error-prone task. This is because humans are bad at list ranking, but better at individual forced choice judgements. Therefore, instead of global ranking, a much more manageable triplet ranking task is used in an embodiment. Specifically, each triplet consists of one query sketch and two candidate photos; the task is to determine which one of the two candidate photos is more similar to the query sketch. Exhaustively annotating all possible triplets (Q(N^A3)) is also out of the question due to the extremely large number of possible triplets. However the inventors have found it to be sufficient to use only a selected subset of the triplets obtain the annotations through the following three steps:

[0081] 1. Attribute Annotation: first an ontology of attributes for shoes and chairs is defined based on existing UT-Zap50K attributes and product tags on online shopping websites. 21 and 15 binary attributes for shoes and chairs respectively were selected and all 1,432 images were annotated with ground-truth attribute vectors. [0082] 2. Generating Candidate Photos for each Sketch: Next, most-similar candidate images, e.g. 10, are selected for each sketch in order to make best use of a limited amount of gold-standard fine-grained annotation effort. In particular, each image was represented by its annotated attribute vector, concatenated with a data driven representation obtained by feeding the image into an existing well-trained deep neural network, such as the recognition network Sketch-a-Net [Sketch-a-Net]

(BMVC' 15), and extracting the FC7 layer activation of the recognition network [Sketch-a-Net] . With this representation, the Euclidean distance between each sketch and image was computed. The coarse rankings obtained from this distance matrix were taken as annotations, except the top 10 most similar examples to each sketch. These more subtle examples were annotated by humans to obtain ground- truth.

[0083] 3. Triplet Annotation: To provide annotations for the triplets selected for manual annotation by the previous step, volunteers were recruited. Each volunteer was presented with one sketch and two photos at a time and asked to indicate which image is more similar to the sketch. Each sketch has 10 · 9/2 = 45 triplets and three people annotated each triplet. The annotations were merged by majority voting to clean up some human errors. These collected triplet ranking annotations can be used in the model of the invention and provide the ground truth for performance evaluation.

[0084] To recap, the problem addressed by the present invention is, for a given query sketch s and a

(set of M candidate photos { }^ 6 P, to compute the similarity between S and p and use it to rank the set of candidate photos so that the true match for the query sketch is ranked at the top. This involves two challenges: (i) bridging the domain gap between sketches and photos, and (ii) capturing subtle differences between candidate photos to obtain a fine-grained ranking despite the domain gap and amateur free-hand sketching. To solve this problem, the present invention provides a deep triplet ranking model learnt using a domain invariant representation _/¾(· ) which enables us to measure the similarity between s and p 6 P for retrieval with Euclidean distance:

D(s, p) = \\fe (s) - fe(p) \\ l (1)

[0085] To learn this representation _/¾(· ) an embodiment of the invention uses the annotated triplets {(Sj, pt⁺, as supervision. A triplet ranking model is thus appropriate. The learning architecture is shown in Figure 9. Specifically, each triplet consists of a query sketch s and two photos p⁺ and p^~, namely a positive photo and a negative photo, such that the positive one is more similar to the query sketch than the negative one. The goal is to learn a feature mapping _/¾(·) that maps photos and sketches to a common feature embedding space, R^d, in which photos similar to particular sketches are closer than those dissimilar ones, i.e., the distance between query s and positive p⁺ is always smaller than the distance between query s and negative p^~:

D(fe (s), fe (p⁺)) < D(f_g(s , f_e (p- ) (2) The embedding is constrained to live on the d-dimensional hypersphere, i.e., |( e( )||² = 1. [0086] A deep triplet ranking model with a ranking loss is formulated. The loss is defined using the max-margin framework. For a given triplet t = (s, p+, p-), its loss is defined as:

e( = max(0, Δ + D(fe(s),fi(p⁺)) - D(fe(s),fi (p ))) (3) where Δ is a margin between the positive-query distance and negative -query distance. If the two photos are ranked correctly with a margin of distance Δ, then this triplet will not be penalised.

Otherwise the loss is a convex approximation of the 0 - 1 ranking loss which measures the degree of violation of the desired ranking order specified by the triplet. Overall we optimise the following objective:

min∑_teT Lg (t) + λϋ(θ) (4) where T is the training set of triplets, Θ are the parameters of the deep model, which defines a mapping _&(^■) from the input space to the embedding space, and R( ) is a regulariser \\ θ \\ . Minimising this loss will narrow the positive-query distance while widening the negative -query distance, and thus learn a representation satisfying the ranking order. With sufficient triplet annotations, the deep model will eventually learn a representation which captures the fine-grained details between sketches and photos for retrieval. Even though the training datasets described above contain thousands of triplet annotations each, they are still far from sufficient to train a deep triplet ranking model with millions of parameters. Therefore, the characteristics of the model of the invention, from architecture design, staged model pre-training to sketch-specific data augmentation, are all designed to cope with the sparse training data problem.

[0087] Heterogeneous vs. Siamese Networks

During training, there are three branches in the network of the invention, and each corresponds to one of the atoms in the triplet: query sketch s, positive photo p⁺ and negative photo p^~ (see Fig. 9). The weights of the two photo branches should always be shared, while the weights of the photo branch and the sketch branch can either be shared or not depending on whether a Siamese network or a heterogeneous network is used.

[0088] Conventional wisdom suggests, for cross-domain modelling if the two domains are drastically different, e.g. text and image, a heterogeneous network is the only option; on the other hand, if the domains are close, e.g. both are photos, a Siamese network makes more sense. The present inventors have determined that a network with heterogeneous branches for the two domains is ineffective for the fine-grained SBIR if training data is limited. This is because the training data is extremely sparse; therefore without using identical architectures and parameter tying, the model would over-fit. Therefore an embodiment of the present invention takes a Siamese network approach and has three identical CNNs for its three network branches. This requires computation of edge-maps from the photos in order to be used as suitable input for the CNN. In future, with more example sketch and photo training data the heterogeneous network could be better and could learn from raw pixel values of photos directly. However, experiments demonstrate that with sparse data for training this network, the Siamese approach performs significantly better.

[0089] For testing, features of sketches and photos (edge maps) are extracted using the sketch branch and photo branch respectively. Then for a query sketch, its ranking result is generated by comparing distances with all candidate photos in the feature embedding space.

[0090] Staged Pre-Training and Fine-Tuning

Given the limited amount of training data, and the fine-grained nature of the final target task, training a good deep ranker is extremely challenging. In practice it requires careful organisation of a series of four pre-training/fine-tuning stages which are described below.

[0091] Category Pre-training: The first step is to train the ranking model from scratch to classify a large number, e.g. 1,000, categories using categorised image data with the edge maps. Desirably, the edge maps are extracted from bounding box areas.

[0092] Category Fine-tuning: The pre-trained ranking model is fine-tuned to classify a smaller number of categories using free-hand sketch images, so that it also represents well the free-hand sketch inputs. In this training session, a novel form of data augmentation is used to improve performance. This data augmentation strategy is discussed below. The result is a set of weights for a single branch of the three-branch ranking network architecture that represent well both free-hand sketch data and photo edge-map data.

[0093] Sketch-Photo Ranking Pre-training: The learned network branch thus far has been optimised for category-level recognition. Turning attention to the ultimate goal of fine-grained retrieval, the three-branch triplet network is initialised with three copies of the ranking model from the previous step. However, since fine-grained intra-category data may be extremely limited, auxiliary sketch/photo category-paired data is additionally used to pre-train the ability to rank.

[0094] Auxiliary sketch/photo category-paired data can be obtained from independent sketch and photo datasets by selecting categories which exist in both datasets, and collecting sketches and photos from each. For sketches, outliers can be excluded by selecting the 60% most representative images in each category (measured by their scores of the category-trained ranking model of the invention for that category). Edge extraction is performed on the photos using the same strategy as used for the pre- training. This can produce a large number, many thousands, of sketches and photos, paired at the category-level.

[0095] In order to use this category-level annotated data to pre-train the triplet ranking model, it is necessary to generate triplets. Given a query sketch, for positive photos, just using the same class is insufficient, because of the within-class variability. This can be done by extracting features from all photos and sketches of the same class using the category -trained ranking model, and using the top 20% most similar images as positives. Negative photos can be sampled from three sources: [0096] 1. Easy negatives: Random photos drawn from a different category. These are obviously less similar to every positive pair, which are drawn from the same category.

[0097] 2. Out-of-class hard negatives: photo images drawn from other categories with distances smaller than the above mentioned positive sketch-photo pairs for every query sketch.

[0098] 3. In-class hard negatives: photos drawn from the bottom 20% most similar samples to the probe within the same category. Overall these are drawn in a 3 : 1 : 1 ratio. Some examples of sampled positive and negative photos can be seen in Fig. 10.

[0099] Sketch-Photo Ranking Fine-tuning: The network so far can be used for fine-grained instance -level retrieval directly if there is no annotated data available for the target object category. However, when data is available it is advantageous to further fine-tune the triplet model specifically for the target scenario. For example, the ranking model from Step 3 is finally tuned on the training split of the shoe/chair datasets described above.

[00100] Data Augmentation

It is increasingly clear that CNN performance ceiling in practice is imposed by limits on available data, with additional data improving performance. Therefore, the present inventors propose two novel sketch-specific approaches to data augmentation that can improve performance. These are stroke removal and stroke deformation. Part of the challenge of sketches is the intra-class diversity: different people can draw exactly the same object in so many different ways. This intra-class diversity is largely due to variation in levels of deformation, curvature and length in individual strokes. Programmatically modifying stroke and object geometry to generate more diverse variants of each input sketch can simulate sketches drawn by different people. In particular, each input sketch is desirably deformed both locally and globally. [00101] Stroke Removal: Sketches captured with appropriate software are different to images that capture all pixels at once. They can be seen as a list of strokes that naturally contain order/timing information. Thus it is possible to generate more sketches by selectively removing different strokes. The proposed stroke-removal strategy considers the following intuitions: 1) The importance of strokes is different. Some strokes are broad outlines of an object which are more important than detailed strokes. 2) The longer the stroke is, the more likely it has a higher importance. 3) People tend to draw the outline first and add details in the end.

[00102] These two points are combined to provide Eq. (5) to determine the probability of removing the i-th stroke:

p_{r. =} I _β (αχο -βχΙ) _{5 t z = ∑}. _e (ax0-/?xi) ₍₅₎ where o and / represent stroke sequence order and length respectively, a and β are two weights for these two factors, and Z is a normalisation constant to ensure it to be a discrete probability

distribution. Overall, the shorter and the later a stroke is, the more likely it will be removed. Fig. 11 shows some examples of original sketches and generated sketches after removing 10%, 30% and 50% of strokes. Clearly they capture different levels of abstraction for the same object (category) which are likely to present in hand-free sketches.

[00103] Stroke Deformation: Different styles of sketching can also be captured by stroke deformations, e.g. by using a Moving Least Squares algorithm for stroke deformation. In the same spirit as stroke removal, the deformation degree should be different across strokes. It can be controlled by the length and curvature of stroke so that strokes with shorter length and smaller curvature are probabilistically deformed more.

[00104] Using stroke -removal and stroke-deformation, it is possible to generate many times the original data by synthesising sketches with different proportions, e.g. 10%, 30% and 50%, of strokes removed, and then applying deformations based on the newly generated sketches. Fig. 12 shows an example of such a process of data augmentation.

[00105] Local Deformation: Another approach to data augmentation is local deformation, i.e.

stroke-level variation. To perform local deformation, it is necessary first to select pivot points. In vector graphic sketch data, such as the scalable vector graphics (SVG), each sketch S is represented as a list of strokes S = {s} (i is the ordered stroke index). Each stroke in turn is composed of a set of segments: s = where each segment b_} is a cubic-bezier spline

b( = (1 - po + 3(1 - + 3(1 -

(6)

0 < t≤ 1

and po and p3 are the endpoints of each bezier curve. Choosing the endpoints of each segment po and P3 as the pivot points for z^'-th stroke (squares in Fig. 13), we jitter the pivot points according to:

p := p + e, s. t. e~J\f (0, rl) (7) where the standard deviation of the Gaussian noise is the ratio between the linear distance between endpoints and actual length of the stroke. This means that strokes with shorter length and smaller curvature are probabilistically deformed more, while long and curly strokes are deformed less. After getting the new position of pivot points (circles in Fig. 13), we then employ the Moving Least Squares (MLS) algorithm to get new position of all points along the stroke. In Fig. 13, the dot-chain line indicates the distorted stroke while the dashed line is the original one. Fig. 13 shows several example sketches with local deformation.

[00106] Global Deformation Alternatively or in addition to locally deforming individual strokes, it is also possible to globally deform the sketch as a whole. First a convex hull algorithm is applied to find the outline shape of the sketch (dashed outline in Fig. 14), and the vertices of the convex polygon whose x/y coordinate is the smallest/largest are used as the pivot points. As with local deformation, Eq. 7 is used to get their new positions and MLS used to compute new position of all points in the sketch. As shown in Fig. 14, squares indicate the pivot points for global deformation and circles are pivot points after translation. Through comparing dashed convex polygons, we can see the effect of global deformation. Fig. 14 displays some sketches with global deformation. [00107] The combination of these two kinds of deformation together, first applying local deformation and followed by global deformation is shown in Figure 15. Experiments show that both deformation strategies contribute to the final recognition performance of the trained model. Variations

[00108] Pre-processing in an embodiment of the invention, both sketches and photos are subjected to pre-processing to alleviate misalignment due to scale, aspect ratio, and centering. The heights of the bounding boxes for both sketches and images are downscaled to a fixed value of pixels while retaining their original aspect ratios. Then the downscaled sketches and images are located to the centre of a blank canvas with the rest padded by background pixels.

[00109] Architecture of the CNN: The architecture of a convolutional neural network that can be used in an embodiment of the invention comprises: five convolutional layers, each with rectifier (ReLU) units, with the first, second and fifth layers followed by max pooling (Maxpool). The filter size of the sixth convolutional layer (index 14 in Table 1) is 7 ^χ 7, which is the same as the output from previous pooling layer, thus it is precisely a fully -connected layer. Then two more fully connected layers are appended. Dropout regularisation is applied on the first two fully connected layers.

[00110] The final layer has 250 output units corresponding to 250 categories (the number of unique classes in the TU-Berlin sketch dataset), upon which we place a softmax loss. The details of an example CNN are summarised in Table 1. Note that for simplicity of presentation, fully connected layers explicitly distinguished from their convolutional equivalents.

[00111] Table 1

Index Layer Type Filter Size Filter Num Stride Pad Output Size

0 Input 225 x 225

1 LI Conv 15 x 15 64 3 0 71 x 71

2 ReLU 71 x 71

3 Maxpool 3 x 3 2 0 35 x 35

4 L2 Conv 5 x 5 128 1 0 35 x 35

5 ReLU 35 x 35

6 Maxpool 3 x 3 2 0 15 x 15

7 L3 Conv 256 1 1 15 x 15

8 ReLU 15 x 15

9 L4 Conv 3 x 3 256 1 1 15 x 15

10 ReLU 15 x 15

11 L5 Conv 3 x 3 256 1 1 15 x 15

12 ReLU 15 x 15 13 Maxpool 3 x 3 2 0 7 x 7

14 L6 Conv(=FC) 7 X 7 512 1 0 1 x 1

15 ReLU 1 x 1

16 Dropout (0.50) 1 x 1

17 L7 Conv(=FC) 512 1 0 1 x 1

18 ReLU 1 x 1

19 Dropout (0.50) 1 x 1

20 L8 Conv(=FC) 250 1 0 1 x 1

[00112] A commonality between the above CNN and some known convolutional neural networks for photograph matching is that the number of filters increases with depth. Specifically the first layer is set to 64, and this is doubled after every pooling layer (indices: 3→ 4, 6→ 7 and 13→ 14) until 512. Also, the stride of convolutional layers after the first is set to one. This keeps as much information as possible. Furthermore, zero-padding is used only in L3-5 (indices 7, 9 and 1 1). This is to ensure that the output size is an integer number.

[00113] On the other hand, differences between the above CNN and other known convolutional neural networks include larger first layer filters. The size of filters in the first convolutional layer might be the most sensitive parameter, as all subsequent processing depends on the first layer output. While classic networks use large 1 1 x 1 1 filters, the current trend of research is moving towards ever smaller filters: very recent state of the art networks have attributed their success in large part to use of tiny 3 x3 filters. In contrast, larger filters are more appropriate for sketch modelling. This is because sketches lack texture information, e.g., a small round- shaped patch can be recognised as eye or button in a photo based on texture, but this is infeasible for sketches. Larger filters thus help to capture more structured context rather than textured information. To this end, a filter size of 15 ^χ 15 is used.

[00114] CNNs used in embodiments of the invention may also differ from conventional neural networks in lacking Local Response Normalisation: Local Response Normalisation (LRN) implements a form of lateral inhibition, which is found in real neurons. This is used pervasively in contemporary CNN recognition architectures (Krizhevsky et al, 2012; Chatfield et al, 2014; Simonyan and Zisserman, 2015). However, in practice LRN's benefit is due to providing "brightness normalisation". This is not necessary in sketches since brightness is not an issue in line-drawings. Thus removing LRN layers makes learning faster without sacrificing performance.

[00115] CNNs used in embodiments of the present invention may also have a larger Pooling Size. Many recent CNNs use 2 x 2 max pooling with stride 2. This approach efficiently reduces the size of the layer by 75% while bringing some spatial invariance. However, a CNN used in an embodiment of the present invention may use a 3 ^χ 3 pooling size with stride 2, thus generating overlapping pooling areas. This can provide ~ 1% improvement without much additional computation. [00116] Deep Multi-Task Embodiment

Another embodiment of the present invention provides a fine grained SBIR model that exploits semantic attributes and deep feature learning in a complementary way. Specifically it performs multitask deep learning with three objectives, including: retrieval by fine-grained ranking on a learned representation, attribute prediction, and attribute-level ranking. Simultaneously predicting semantic attributes and using such predictions in the ranking procedure help retrieval results to be more semantically relevant. Importantly, the introduction of semantic attribute learning in the model allows for the elimination of the cost of human annotations required for training a fine-grained deep ranking model. Experimental results demonstrate that this embodiment outperforms the state-of-the-art on challenging fine-grained SBIR benchmarks while requiring considerably less annotation.

[00117] This embodiment takes advantage of a DNN's strength as a representation learner, but also combines this with semantic attribute learning, resulting in a deep multi-task attribute-based ranking model for FG-SBIR. In particular, this embodiment includes a multi-task DNN model, where the main task is a retrieval task with triplet-ranking objective as described above, and attributes are detected and exploited in two side tasks, which are also referred to herein as auxiliary tasks. The first side task is to predict the attributes of the input sketch and photo images. By optimising this task at training-time, it is encouraged that the learned representation more meaningfully encodes the semantic properties of the photo/sketch. The second side-task is to perform retrieval ranking based on the attribute predictions themselves. At test-time, this means that the retrieval ordering is explicitly driven by semantic attribute-level similarity as well as the similarity of the internally learned representation. An embodiment of the invention may have only one auxiliary task rather than two as described below. An embodiment of the invention may have more than two auxiliary tasks, such as prediction of other attributes such as material, style, product price and/or brand.

[00118] When images are retrieved, predicted attributes of the retrieved images can be displayed to the user. Retrieved images can be sorted and/or filtered by one or more predicted attributes. A multistage search can be performed by receiving a user selection of attributes from the first search results and then performing a second sketch-based search within images having the user-selected attributes.

[00119] This novel deep multi-task attribute-based ranking network architecture has a number of advantages over existing methods:

(1) The unique domain-invariant nature of visual attributes helps to bridge the cross-domain gap between photos and sketches.

(2) By introducing multiple tasks in the network, the model generalises better and further can rely less on expensive human ranking annotation.

Specifically we show that the non-scalable step of triplet annotation required by the embodiment of Figure 1 can now be avoided and an automatic attribute -based strategy is developed instead to focus on the most informative 'hard' training samples for more efficient learning of the model. [00120] This embodiment provides two additional features:

(1) A novel deep (multi-task learning MTL) model to exploit two attribute-based auxiliary tasks for learning semantically meaningful and domain-invariant representation for FG-SBIR.

(2) A new attribute -based triplet generation and sampling strategy is developed to boost the effectiveness of the deep MTL model.

Extensive experiments on common benchmarks demonstrate that this embodiment significantly outperforms the state of-the-art while simultaneously requiring less costly annotation.

[00121] In this section we describe our multi-task deep neural network for fine-grained SBIR. The DNN architecture is illustrated in Figure 16.

[00122] The proposed network is a three branch network. Each input tuple consists of three images corresponding to the query sketch (gone through the middle branch), positive photo image (top branch) and negative photo image (bottom branch) respectively. The positive photo has been annotated as more visually similar to the query than the negative photo. The learned deep model aims to enforce this ranking in the model output. As shown in Figure 16 the architecture of the task-shared part consists of five convolution layers with max pooling as well as a fully -connected (FC) layer, to learn a better representation of original data via feature maps. After these shared layers, different tasks evolve along separate branches: in the main task, one more FC layer with dropout and rectified linear unit (RELU) are added to represent the learned fine-grained feature vectors.

[00123] Similarly, in the auxiliary task, a FC layer (with dropout and RELU) extracts fine-grained attribute representations followed by a score layer to make predictions. Next the three tasks and their uniquely associated layers are described in detail.

[00124] Main Triplet Ranking Task: the main task is sketch-photo ranking, and in this respect the network of this embodiment is similar to the embodiment of Figure 1 , except for the additional dropout to reduce overfitting. The main task is trained by supervision in the form of triplet tuples, with each instance tuple ^ ^! containing an anchor sketch s, positive photo p⁺ and negative photo p^~. Corresponding to these input elements, the network has three branches and the goal is to learn a representation, such that the positive photo p⁺ is ranked above the negative photo p^~ in terms of its similarity to the query sketch s. To this end, the main task loss function is triplet ranking loss:

L (s. p⁺ _}p^~) = max (θ,Δ+ Ζ) (f_& (*) ,/ø (/>+) ) - D (fy (s) , fe (p^~)) )

(⁸) where q represents the parameters of DNN, ^ ^^') denotes the learned deep feature of the

rw .

corresponding network branch, ^ ' V denotes the squared Euclidean distance, and D is the required margin of ranking for the hinge loss.

[00125] Attribute Prediction Task: In order to encourage the learned network representation to encode semantically salient properties of objects (and thus help the main task to make better

(dis)similarity judgements for ranking), we also require the network to predict semantic attributes - such as whether a shoe is high-heeled, or whether a chair has arm-rests. For this task it is assumed that each training sketch s (or photo p) is annotated with N different semantic attributes, thus providing training tuples . ¹ . . . Prediction an attribute vector of a sketch/photo image is a multi-label classification problem because attributes are not mutually exclusive. For convenience, it is assumed that each attribute is binary, although this is not a necessary limitation of the present invention. In this case the attribute prediction loss is the cross-entropy between the attribute labels and predictions

" ^■', so for sketch attribute prediction we have

and similarly the loss functions for the positive and negative photos are obtained by replacing s with p⁺ and p^~ respectively. This attribute prediction task can then be trained simultaneously with the main sketch-photo ranking task

[00126] Attribute Ranking Task: The attribute-prediction task above ensures that the learned representation of the network encodes semantically salient features that support attribute prediction. Since retrieval ranking is the main task, the attribute predication would not be used during test-time. This task's effect on the main task is thus implicit rather than direct. However, as a semantic representation, attributes are domain invariant and thus intrinsically useful for matching a photo with a query sketch. To this end, a third task of attribute-level sketch-photo matching, which matches based on the predicted attributes of sketch and photo input rather than on an internally generated

representation, is introduced.

[00127] The loss function used for this task deserves some thought. A straightforward choice would be treating the attribute prediction exactly the same way as the learned deep representations from the bottom five feature extraction layers of the network and use a loss that is similar to that in Eq. (8), i.e., a triplet ranking loss. Specifically since the attribute predictions are probabilities, attribute predictions from the three branches are compared with cross-entropy rather than squared Euclidean distance as in the main task:

F⁺,P-} -™ { +E{ff {_S) %* (p⁺)) -M ( is) ^ (p-))}

where Η(·) is the cross-entropy between the attribute prediction vectors of the corresponding branches.

[00128] However, there is a subtle but critical difference between the learned deep feature representation and attribute predictions: they have very different dimensionalities - the attributes are in the order of tens whilst the deep features are thousands. This means that they have different levels of discriminative power and thus need to be treated differently when designing cross-domain matching losses. In particular, given a dozen attributes, many similar photo images could have very similar or even identical sets of attributes; forcing them to be different in order to enforce the ranking as in Eq. (10) would be too strong a constraint that is difficult to meet. Taking this into consideration, a more relaxed attribute-similarity loss functi is adopted instead:

which encodes a weaker constraint that the positive photo should have similar attributes to the anchor sketch, and is found to be empirically better than the full triplet ranking loss in experiments. This attribute similarity loss obviously has an effect on how the training tuples are selected, i.e. the sampling strategy, which will be discussed below.

[00129] The overall loss function for multi-task training for an embodiment of the invention having three tasks is given by the weighed sum:

L (s_tp⁺,p^~) (s_tp⁺,p^~) + aL_a (s,p⁺,p^~) + »· (s ^x) + l_{p +} L_p (p⁺,t^{p+ i}

(12) where the first term is the main ranking task, the second term is the attribute ranking task, the next three are attribute predictions for each network branch, and the last one is a regularization term to suppress the complexity of weights. Here the relative weight of each side task is denoted by the hyper parameters ⁼ (*^«> ' )'

[00130] Multi-Task Testing: At run-time the main and attribute -ranking tasks are used together to generate an overall similarity score for a given sketch/photo pair. All sketch/photo pairs are ranked, and the retrieval for a given sketch is the similarity-sorted list of photos. Specifically for a given query sketch s the similarity to each image p in the gallery set is calculated as

Rs (s, p) = D (fy (s) ,/« &>)) + W C (i) _r (p)) .

·^{. .} . ( 13) where D( ) and Η(·) are squared Euclidean distance and cross-entropy respectively.

[00131] Staged Model Pre-training

A staged pre-training strategy is adopted similar to that of the embodiment of Figure 1. Specifically first a single branch classification model with the same feature extraction layers as the proposed full model is pre-trained to first classify ImageNetlK data (encoded as edge maps). This model is very similar to the Sketch-a-Net model designed for sketch classification. This is followed by fine tuning on the 250 classes TU-Berlin sketch recognition task. After that, this single branch network is extended to form a three-branch Siamese triplet ranking network. Each branch is initialised as the pre-trained single-branch model, and the model is then fine tuned on a category-level photo-sketch dataset re- purposed for fine-grained SBIR as described above. After these three stages of pre-training, the full model with two added side-tasks and the overall loss in Eq. (12) is then initialised and fine tuned with the fine-grained SBIR dataset for within -category sketch-based photo retrieval.

[00132] Attribute-based Sampling Strategy Determining an optimal sampling strategy for constructing the anchor-positive-negative triplet tuples for model training is critical. There two major choices: (1) how to generate the triplets and (2) how to select a subset of them for model training. For the former, one straightforward choice is that given each anchor/query sketch, to form exhaustive photo pairs and present the resultant triplets for humans to annotate which photo is more similar to the anchor. However, this is expensive even for a moderate data size. Hence in the first embodiment the top-10 ranked photos for a given anchor is selected, where exhaustive human annotation is collected, yielding a total of 10-9/2=45 triplets per sketch. All of this superset of 45 human annotated triplets are then used to train a triplet ranking model.

[00133] However, there are two problems: (1) even with pre-screening, the exhaustive annotation is still expensive, and (2) the collected annotations are error-prone, since top ranked photos are all very similar to each other, making triplet ranking a challenging task for humans to perform reliably (see Figure 17 - some pairs in the list are hard to order by similarity with respect to the query). The reliability of human annotation can be improved by employing a global ranking method to correct annotation noise.

[00134] However, there is no solution to the scalability issue. In this embodiment, a new way to generate the triplets and a novel sampling strategy are developed, which entirely removes the need for the otherwise non-scalable and unreliable human triplet annotations

[00135] Triplet Generation: Instead of choosing top-10 most similar photos and asking humans to annotate, this embodiment automatically generate triplets based on a strict top-10 ranking induced by attribute and feature similarity. More specially, attribute similarity is used first to construct a top- 10 candidate list of most similar photos given a query sketch. ImageNetCNN features are then used to further rank these photos by similarity with respect to the ground-truth match. Intuitively this strategy can be seen as using semantic attribute properties to generate a meaningful short list, but otherwise driving the cross-domain ranking objective by more subtle photo-photo similarity encoded by a well- trained ImageNet CNN. It follows that a total of, e.g., 45 triplets can be automatically generated by enforcing ranks among candidate photos within each triplet (i.e., photo with higher rank is annotated as positive and vice versa). In Figure 17, the automatic top-10 ranking is compared with a globally optimised ranking computed from human triplets. Overall the automatic one is of comparable (or better) quality than the more costly manually generated list.

[00136] Triplet Sampling: A further novel feature of this embodiment is that instead of using all triplets, a plurality, e.g. 9, of the hardest ones are selected for model training, each consisting of the anchor and two photos of neighbouring ranks (e.g., anchor-Rl-R2 or anchor-R4-R5). It can be shown empirically that this choice of learning curriculum significantly boosts model performance compared to alternatives ranging from exhaustive sampling, easy, and medium. Seemingly counter-intuitive to the conventional 'more data is better' maxim, there are two explanations of why sampling a small subset of hard samples helps: Firstly after extensive (three) stages of model pre-training, the model has already learned a strong domain-invariant representation; it is therefore 'ready' to accept hard training samples. Secondly and importantly, the introduction of the two additional attribute -based side tasks means that the model is much more robust against overfitting with small training data size.

[00137] Experiments

Datasets and Settings

Training and Evaluation Data: We use the same shoe and chair FG-SBIR datasets described above. For training, 304 sketch-photo pairs of shoes, and 200 pairs of chairs were used. Each sketch/photo comes with attribute annotations, which are used to obtain the top 10 photo rank list and additionally to learn attribute-based tasks in the multi-task model. Data augmentation like flipping and cropping is applied.

[00138] Network Implementation: We use the Caffe library [Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACMMM, 2014.] to implement our deep multi-task model. Task-importance parameters are set to

λ ~* { *Λ» - ) " i Ι,Λ& ΜΜ ,ΟΛΙ), i _{e 5} the main and attribute -level ranking tasks have equivalent weight, and the attribute- prediction tasks all have the same lower weights. The single loss margin was set to D=l . During joint training, the batch size is 128, and the network was trained with a maximum of 25000 iterations. The base learning rate was 0.001 and weight decay (lq) was set to 0.0005.

[00139] Evaluation metrics: To evaluate performance, the same two evaluation metrics as used above were used: Top-K retrieval accuracy for K=l and K=10. This corresponds to the use scenario where there is a particular object that the user needs to retrieve exactly. An alternative scenario is where the user just wants to see similar items to the sketch, and in this case the overall ordering is the salient metric. For this the percentage % of correctly ranked triplets is used, which reflects how well the predicted triplet ranking agrees with that of humans.

[00140] Baselines: The multi-task model is compared with several baselines, including the Triplet model described above. As representatives of the classic approaches, RankSVM is trained based on HOG features extracted and encoded as either bag of words (BoW-HOG+rankSVM), or large dense vectors (Dense-HOG+rankSVM). As representatives of alternative deep feature-based approaches, Sketch-A-Net deep features, and 3D shape deep features for RankSVM training

(3DSDeep+RankSVMand ISNDeep+RankSVMrespectively) are also extracted.

[00141] Results

Comparisons against the state-of-the-art: FG-SBIR retrieval performance is first evaluated to compare our multitask model with the state-of-the-art methods outlined previously. From the results in Table 2 below it can be seen we see that our MTL obtains much higher accuracy compared to previous work, especially for Rank-1 matching accuracy - around 10% improvements over the triplet model are achieved, despite the fact that the triplet model requires costly human triplet annotations not necessary with this embodiment.

Table 2: Comparative results against state of the art retrieval performance.

[00142] Contributions of Auxiliary Tasks The main reason the MTL model outperforms the triplet model is due to the benefit provided by the auxiliary attribute related side tasks: indirectly in the case of attribute prediction (AP) and directly in the case of attribute ranking (AR). To demonstrate this the performance of the full model is compared with the performance obtained by removing one or both of the auxiliary tasks (e.g., "Ours -AP" means the full model with the AP task removed). From the results in Table 3 below, it can be seen that each task helps, as performance drops when either is removed, and drops further when both are removed.

Table 3: Contribution of the proposed attribute side tasks.

Shoe Dataset top 1 top trip- Chair Dataset top 1 top 10 trip-acc

10 acc

Ours - AP - AR 37.39% 82.6 66.57 Ours - AP - AR 50.5 91.75% 69.62%

1% % 2%

Ours AR 45.22% 87.8 72.37 Ours AR 72.1 98.97% 72.00%

3% % 6%

Ours AP 44.35% 86.9 71.34 Ours AP 72.1 98.97% 72.10%

6% % 6% Ours 50.43% 91.3 70.59 Ours 78.3 98.97% 71.13%

0% % 5

[00143] Comparison of Triplet Generation and Sampling Strategies: Two ways of generating triplets and various sampling strategies are described. Generation: the triplets are generated either automatically (using attribute/feature ranking) or manually by humans.

[00144] As mentioned earlier, the original human annotation can be noisy, thus human annotations are cleaned by inferring a globally optimised rank list from the annotated pairs using the generalised Bradley-Terry model [Francois Caron and Arnaud Doucet. Efficient bayesian inference for generalized bradley-terry models. Journal of Computational and Graphical Statistics, 21(1): 174-196, 2012]. Sampling: using either generation method, 10 photos are ranked for any given sketch which gives a total of 1.9/2=45 triplets. Sampling options include: (i) Exhaustive: use all 45 triplets with no sampling, or (ii) Hard: sample the 9 hardest triplets as proposed. A network is also trained using the same human annotated triplets used above as baseline

[00145] Table 4 above compares results obtained by our model using different triplet

generation/sampling strategies. We can draw the following conclusions: (1) Our automatically generated hard triplet sampling strategy performs best overall. (2) In general, using a smaller number, e.g. 9, hard triplets performs better than the 45 exhaustive triplets, for either manual or automatic generation. This suggests that hard triplets help learn a better fine-grained cross-domain

representation. (3) Overall, the auto-generated triplets produce better performance than the human annotated triplets. The above results are somewhat surprising, as the conventional wisdom is that 'more data is always better' and that careful manual annotation should be better than automatic annotation. The superiority of fewer harder triplets can be attributed to the fact that the base model is already quite well pre-trained, so that at the point we start training it is 'ready' for difficult examples in a curriculum learning sense; and the superiority of generated triplets to manually annotated triplets to the fact that the similarity judgements are quite hard to make reliably given the short list of similar images, so in this case the human annotation is no more reliable than the automatic annotation.

[00146] Next the issue of sampling triplets according their difficulty level is investigated further. Hard triplets are defined as before where each triplet spans a distance of 1 on the rank list. Medium triplets are defined as those with distance 2 and 3 and easy triplets are those with distance larger than 3. Thus within the top-10 list, the 45 exhaustive triplets include 9 hard, 15 medium and 21 easy ones. The results in Table 5 show that performance increases with triplet difficulty supporting the hypothesis that hard triplets are the most valuable at this stage.

[00147] Qualitative Results: Example retrieval results of the multi-task model of the present invention are shown in Figure 18, where the retrieved image with thicker outline is the ground truth.

[00148] Computational Cost: The deep multi-task model was trained on an Nvidia TeslaK80GPU. The reimplementation of the sketch triplet model takes about 5 days. The joint training of the proposed deep multi-task model takes about 7 hours for 25,000 iterations of batches for either chair or shoe dataset.

[00149] Accordingly the present embodiment provides a deep multi-task attribute -based model for fine-grained SBIR. By constructing attribute-prediction and attribute-based ranking side-tasks alongside the main sketch-based image retrieval task, the main task representation is enhanced by being required to encode semantic attributes of sketches and photos, and moreover the attribute predictions can be exploited to help make similarity predictions at test time. The combined result is that performance is significantly improved compared to models using a deep triplet ranking task alone. Beyond this it is shown that somewhat surprisingly the human subjective triplet annotation is not critical for obtaining good performance. This means that it is relatively easy to extend the method to new categories and larger datasets, since attribute annotation grows only linearly rather than cubically in the amount of data. [00150] Having described embodiments of the present invention, it will be appreciated that such embodiments are illustrative and not limiting of the present invention, which is defined by the appended claims. In particular, the multi-task module described with reference to Figure 16 can be substituted for the triplet model in the embodiment described with reference to Figures 1 to 15.

Claims

1. A method of searching for specific images of a target object, the method comprising:

providing the ranked list of images to the user.

2. A method according to claim 1, wherein the deep triplet ranking model is a convolutional neural network.

3. A method according to claim 2, wherein the convolutional neural network is a Siamese network.

4. A method according to claim 1, 2 or 3 wherein the deep triplet ranking model is a multi-task model.

5. A method according to claim 4 where the multi-task model has been trained to learn an auxiliary task comprising an attribute prediction task and/or an attribute learning task.

6. A method according to claim 4, wherein the ranked list includes predicted attributes for each image and further comprising receiving a user selection of attributes and using the deep triplet ranking model to compare the sketch data to images of the gallery of images that have the user-selected attributes to obtain a second ranked list of images.

7. A method according to any one of the preceding claims, wherein the sketch data includes information representing the order of strokes in the sketch.

8. A method according to any one of the preceding claims further comprising:

identifying the part of the object in the selected image indicated by the additional sketch data; using a strongly-supervised deformable part-based model to compare the part of the object to corresponding parts of the images of the gallery of images to obtain an updated ranked list of images; and

providing the updated ranked list of images to the user.

9. A method according to any one of the preceding claims further comprising:

providing the second and third ranked lists of images to the user.

10. A method according to claim 9 wherein the second and third ranked lists of images are provided to the user as a merged updated ranked list of images.

11. A method of training a neural network to perform fine-grained sketch-based image retrieval, the method comprising:

training the neural network using the plurality of triplets.

12. A method according to claim 11 wherein generating a plurality of triplets includes selecting positive images and/or negative images by extracting features using a category-trained ranking model.

13. A method according to claim 11 or 12 further comprising generating a plurality of additional sketches by modifying sketches of the training sketch gallery.

14. A method according to claim 13 wherein modifying sketches comprising selectively removing strokes forom a sketch.

15. A method according to claim 14 wherein selectively removing strokes comprises randomly removing strokes with a probability based on stroke length and stroke order.

16. A method according to any one of claims 13 to 15 wherein modifying sketches comprises deforming strokes of sketches individually.

17. A method according to any one of claims 13 to 16 wherein modifying sketches comprises deforming sketches as a whole.

18. A method according to any one of claims 11 to 17 further comprising pre -training the neural network using images to recognise a plurality of categories of object.

19. A method according to claim 18 further comprising fine-tuning the pre-trained model using sketches to recognise a plurality of categories of object.

20. A method according to any one of claims 11 to 19 wherein the neural network is a three- branch triplet network.

21. A method according to any one of claims 11 to 20 wherein the training objective is a triplet ranking objective.

22. A method according to any one of claims 11 to 21 wherein the neural network is trained to perform an auxiliary task.

23. A method according to claim 22 wherein the auxiliary task comprises an attribute prediction task and/or an attribute ranking task.

24. A method according to any one of claims 11 to 23 wherein the training is performed with a plurality of hard triplets selected from the generated triplets.