CN113868448A - Fine-grained scene level sketch-based image retrieval method and system - Google Patents

Fine-grained scene level sketch-based image retrieval method and system Download PDF

Info

Publication number
CN113868448A
CN113868448A CN202111004545.6A CN202111004545A CN113868448A CN 113868448 A CN113868448 A CN 113868448A CN 202111004545 A CN202111004545 A CN 202111004545A CN 113868448 A CN113868448 A CN 113868448A
Authority
CN
China
Prior art keywords
scene
sketch
image
graph
graph structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111004545.6A
Other languages
Chinese (zh)
Inventor
马翠霞
刘舫
陈科圻
邓小明
王宏安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Publication of CN113868448A publication Critical patent/CN113868448A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a system for retrieving fine-grained scene-level sketch-based images, which belong to the field of computer vision, aim to solve the problem that most of the existing sketch-based image retrieval methods are single-object-oriented and category-level image retrieval, expand the application of sketch retrieval in actual life, establish a self-adaptive graph convolution neural network by introducing an attention mechanism, match a sketch with an image to be retrieved by utilizing a triple network, model the scene sketch and the image to be retrieved on three levels of global layout, category level and example level of a scene, integrate information of each level and realize accurate matching of the input scene sketch and the image.

Description

Fine-grained scene level sketch-based image retrieval method and system
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a method and a system for searching fine-grained scene-level images based on sketches.
Background
Visual media application based on sketch interaction is always a research hotspot in the fields of man-machine interaction, computer vision and multimedia, and how to optimize the basic processing of sketch data and improve the efficiency of the visual media application based on sketch is a key research problem. Sketch interaction is widely applied to various aspects in life and work, including drawing, note shorthand, document marking, webpage User Interface (UI) and concept design in the Internet industry, animation and movie making in the movie animation industry and the like. In recent years, research and application related to sketch interaction have attracted wide attention in both industrial and academic circles, and one of the important reasons is the explosive development of touch screen hardware devices (e.g., microsoft Surface series touch notebook computers, Apple pen, etc.). Under the artificial intelligence era, on one hand, a user can acquire sketch data more conveniently; on the other hand, the performance of the sketch data algorithm based on the deep learning technology is continuously improved. Applications and tasks based on sketch interaction have also gained unprecedented popularity.
Sketch interaction related research and application have attracted wide attention in both the industrial and academic circles, and one of the important reasons is the explosive development of touch screen hardware devices (e.g., smart phones, tablet computers, etc.). Under the artificial intelligence era, on one hand, a user can acquire sketch data more conveniently; on the other hand, the performance of the sketch data algorithm based on the deep learning technology is continuously improved. The research of applications and tasks based on sketch interaction is also developed unprecedentedly, and the sketch interaction tasks and the development trend thereof comprise:
(1) in the aspects of sketch interaction tasks and sketch data processing, sketch processing technologies such as sketch identification, sketch simplification, sketch coloring and the like are widely researched. Sketch-a-net constructs a Sketch recognition algorithm model based on a convolutional neural network, and achieves good Sketch recognition performance (the reference documents are Yu, Qian, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xian, and Timothy M.Hospitals. "Sketch-a-net: A deep neural network at the strategies handbooks." International road of computer vision 122, No.3(2017): 411-. Left to simple, a sketch drawn by a user can be automatically simplified and optimized based on a data-driven method, and operations such as removing messy redundant lines, beautifying lines and the like are included (references: Simo-Serra, Edgar, Satoshi Izuka, Kazuma Sasasaki, and Hiroshi Ishikawa. "left to simple: full volumetric networks for core skin clearance." ACM Transformations On Graphics (TOG)35, No.4(2016): 1-11.).
(2) Some new sketch interaction tasks are also proposed, such as sketch-based generation, sketch interaction-based model generation, sketch abstraction based on reinforcement learning, sketch-based image editing, sketch segmentation based on a graph convolution network. Liu et al propose an Image Retrieval method SceneSkyr based on Scene sketch (reference: Liu, Fan, Changqing Zou, Xialoging Deng, Ran Zuo, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, and hong Wang. "SceneSkyr: Fine-Grained Image Retrieval with Scene sketches." In European Conference on Computer Vision, pp.718-734.Springer, Cham,2020.) capable of retrieving similar Scene images according to the sketch input by the user.
(3) For the study of sketch interaction, the development of fine-grained methods is carried out in response to the requirements of practical application. In particular, some fine-grained sketch interaction tasks have been proposed in recent years relative to overall-based tasks (e.g., sketch recognition). At present, most of image retrieval related technologies based on sketches are established on the premise of instance level and category level retrieval, namely: the input sketch and the image object to be retrieved are both single objects; and, the object of the retrieval result image is consistent with the input sketch object in category, namely the retrieval is correct. Conventional example-level, category-level sketch-based image retrieval methods only focus on retrieving images of the same category, and typically ignore the shape, pose, and other fine-grained attributes of the retrieved images. Compared with the category-level sketch-based image retrieval, the text retrieval can express the category semantics and simultaneously query more simply, so that the traditional sketch-based image retrieval is not widely applied in practice.
The special high abstraction, intuition and conciseness of the sketch enable the sketch to be widely applied in various fields of human-computer interaction, computer vision, multimedia, computer graphics and the like. From the 1960 s to the present, with the continuous improvement of data processing technology, sketch related research and application are continuously optimized. Among them, Sketch-based image retrieval (SBIR) is one of the most widely used and representative Sketch applications, and in the intelligent era, Sketch-based image retrieval also faces new development and challenges.
Disclosure of Invention
The invention aims to provide a fine-grained scene-level sketch-based image retrieval method and system, wherein an attention mechanism is introduced to establish an adaptive image convolution neural network, the scene sketch and the image to be retrieved are modeled on three levels of a global layout, a category level and an example level of a scene, information of each level is integrated, and accurate matching of an input scene sketch and the image is realized.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a fine-grained scene-level sketch-based image retrieval method comprises the following steps:
1) respectively constructing a graph structure (graph) for the scene sketch and a scene image to be retrieved, wherein each node in the graph structure represents an object class in the scene, and edges in the graph structure represent the relationship between the object class and the class in the scene;
2) extracting graph structure characteristics of a scene sketch and a scene image according to the constructed graph structure by using an attention-based adaptive graph convolution neural network;
3) in the training stage, the image data for training is processed in the steps 1) and 2), the graph structure characteristics of a scene sketch and a scene image are extracted, the ternary network (triplet network) is used for carrying out Euclidean distance calculation of positive and negative samples on the extracted graph structure characteristics, and the triplet loss is calculated according to the calculated Euclidean distance; according to the triple loss, the adaptive graph convolution neural network is updated through back propagation, network parameters are optimized, and the trained adaptive graph convolution neural network is obtained;
4) in the testing stage, the scene sketch and the scene image to be retrieved are processed in the steps 1) and 2), the graph structure characteristics of the scene sketch and the scene image are extracted by using the trained self-adaptive graph convolution neural network, and the scene sketch and the scene image to be retrieved are matched according to the Euclidean distance of the graph structure characteristics of the scene sketch and the scene image to be retrieved, so that an image retrieval result is obtained.
Further, the construction of the graph structure comprises the construction of nodes and the construction of opposite sides, wherein the construction method of the nodes comprises the following steps: clustering the instances in the scene according to the object types, and acquiring node characteristics according to the clustered instances, wherein the node characteristics comprise type labels, visual characteristics of the instances and position information of the instances.
Further, the step of obtaining the node characteristics comprises:
1) detecting a scene sketch or a scene image through a trained target detection network Yolo-V4 to obtain the position and category information of each instance in the scene;
2) extracting visual features of each instance in each category of the scene through a visual feature extraction network inclusion-V3, and connecting the visual features of each instance with position information (associate) to form instance features;
3) and performing feature fusion on all the example features in each category through a convolutional neural network to obtain each node feature.
Further, Yolo-V4 was trained based on the CoCo-Stuff database, and inclusion-V3 was trained based on the ImageNet dataset.
Further, the position information of the example is represented by a four-dimensional vector, and the four-dimensional numerical values of the vector respectively represent coordinate points of the upper left corner and the lower right corner of the rectangular bounding box of the example.
Further, the visual features are 2048-dimensional vectors, and 2052-dimensional example feature vectors are obtained by connecting the visual features of each example with the position information.
Further, the construction method of the opposite side comprises the following steps: and calculating the Euclidean distance of the two nodes, normalizing the Euclidean distance, and constructing an edge by taking the normalized Euclidean distance as the weight of the edge.
Further, three different graph structures are simulated by constructing three different adjacency matrices, including the following steps:
1) calculating Euclidean distances between the center positions of the examples in each category, and normalizing to obtain an edge A1 of the graph structure;
2) extracting Word vectors of all kinds of labels through a Word2Vec Word embedding algorithm, and calculating cosine distances among all kinds of label Word vectors to serve as edges A2 of the graph structure;
3) introducing a learnable adjacency matrix as an edge A3 of the graph structure and randomly initializing the learnable adjacency matrix;
4) three adjacency matrixes are obtained according to the edges A1, A2 and A3, the three adjacency matrixes are added to obtain an updated adjacency matrix of the graph convolution neural network, and the graph structure is represented by the updated adjacency matrix.
Further, the graph convolution neural network extracts features from the graph structure through an affine function; the graph convolutional neural network has a plurality of network layers, each layer from the second layer taking as input the output of the previous layer and the adjacency matrix of the graph structure.
A fine-grained scene-level sketch-based image retrieval system, comprising:
the target detection network is used for detecting a scene sketch and a scene image to obtain the position and the category information of each instance in the scene;
the visual feature extraction network is used for extracting the visual features of the examples in each category of the scene and connecting the visual features of the examples with the position information to form example features;
the single-layer convolutional neural network is used for fusing all the example characteristics in each category to obtain each node characteristic;
the adaptive graph convolution neural network is used for extracting graph structure characteristics of a scene sketch and a scene image according to the constructed graph structure; calculating Euclidean distance between the two image structure characteristics according to the image structure characteristics of the scene sketch and the scene image, and matching the scene sketch with the scene image to be retrieved;
the triple network is used for carrying out Euclidean distance calculation of positive and negative samples on the extracted graph structure characteristics in a training stage and calculating triple loss according to the calculated Euclidean distance; and according to the triplet loss, performing back propagation to update the self-adaptive graph convolution neural network and optimize network parameters.
Compared with the prior art, the invention has the beneficial effects that:
1. the method improves the traditional sketch-based image retrieval task, focuses on fine-grained and scene-level image retrieval, can expand the application context of sketch retrieval, and promotes the application of sketch retrieval in actual life.
2. The invention provides a method for representing scene sketch and images by using a graph structure, explicitly simulating object types in the scene by using nodes of the graph structure, and simulating the spatial relationship and semantic relationship between the types in the scene by using the sides of the graph structure, so that a multi-level scene sketch-image retrieval and matching model of global layout-type level-example level can be established.
3. The adaptive graph convolution module based on the attention mechanism can increase the flexibility of the model in the aspect of graph feature and object relation learning, and allows the visual feature extraction network training and graph rolling machine neural network module to be trained simultaneously in an end-to-end mode, so that the retrieval performance of the model is optimized.
Drawings
Fig. 1 is a schematic diagram of a fine-grained scene-level sketch-based image retrieval network structure according to the present invention.
FIG. 2 is a diagram illustrating the structure of a graph node according to the present invention.
FIG. 3 is a diagram of an adaptive graph convolution network module according to the present invention.
Fig. 4 is an example of image retrieval according to the present invention.
FIG. 5 is a comparison of the search results of the present invention and SceneSketcher 1.0.
Detailed Description
The process of the present invention is described in further detail for the purpose of better understanding of the present invention by those skilled in the art, but is not to be construed as limiting the present invention.
The embodiment discloses a fine-grained scene-level sketch-based image retrieval method, which mainly comprises the following steps: providing a scene sketch and a scene image graph structure construction method; introducing an attention mechanism to construct an adaptive graph convolution neural network; the spatial position information of the design diagram structure, the semantic relation and the construction method of the learnable adjacency matrix are jointly used for updating the self-adaptive graph convolution neural network.
Fig. 1 is a schematic diagram of a fine-grained scene-level sketch-based image retrieval network structure provided by the present invention, which includes: (1) the method comprises the steps of inputting a scene sketch, a positive example image matched with the scene sketch and a negative example image not matched with the scene sketch; (2) a self-adaptive graph convolution feature extraction network; (3) a three-tuple network.
The method mainly comprises the following steps:
1. construction process of Scene graph (Scene graph)
The graph structure is represented as G ═ (N, E), where N ═ NiIs the set of nodes of the graph structure, node niThe label of class in the representative scene is ciThe object class of (1); e ═ Ei,jIs the set of edges, ei,j=(ni,nj) Is connecting node niAnd node njThe edge of (2). The set of categories in the scene is denoted C ═ { C ═ Ci}。
The embodiment selects the category of the scene consistent with the node representationUsing the visual characteristics and the position information of each example to construct a node n in the graph structureiIs characterized by comprising the following steps:
1) from the CoCo-Stuff database (reference: caesar H, Uijlings J, Ferrari V. COCO-Stuff: Thing and Stuff Classes in Context [ J]2016.) selected 16379 images to train the Yolo-V4 target detection network (reference: bochkovskiy A, Wang C Y, Liao H. YOLOv4: Optimal Speed and Accuracy of Object Detection [ J]2020.), and processing the scene sketch and the image to be retrieved by using the target detection network to obtain the position and the type of each object instance in each scene. In particular, given an object class ci,{oijIs the category label in the scene is ciA set of instances of (c); obtaining each example o through a Yolo-V4 target detection networkijPosition information p ofij
2) Visual feature extraction networks pre-trained on ImageNet datasets (inclusion-V3, ref: szegedy C, Vanhoucke V, Ioffe S, et al]2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016: 2818-2826) model to extract instances oijVisual feature v ofijThe visual feature is a 2048-dimensional vector;
3) example oijPosition information p ofijRepresenting the node object as a 4-dimensional vector, wherein four numbers in the vector respectively represent coordinate points of the upper left corner and the lower right corner of a rectangular bounding box (bounding box) of the node object;
4) example oijVisual feature v ofijPosition information pijConnecting to form an example feature vector of 2052 dimensions;
5) class ciExamples of (a)ijThe example feature vector is subjected to feature fusion through a single-layer convolutional neural network to obtain a graph structure node niFeature vector x ofiRepresenting the class c of objects in the sceneiCharacteristic information of (1).
For the construction of edges in the graph structure, for two nodes niAnd njDefine the edge ei,j=(ni,nj) Weight A ofi,jFor normalized euclidean distance:
Ai,j=1-Di,j
wherein Di,j=||xj-xi||2Is a node niAnd njThe euclidean distance between them.
2. Graph convolution neural network (GCN)
The GCN extracts features from the graph G ═ (N, E) through an affine function f (·,). For each layer of GCN, the input is the output of the GCN of the previous layer and the adjacency matrix A ═ A of the graph structurei,j}. The propagation function of the GCN base network at layer i can be written as:
Figure BDA0003236799830000061
H(l)=f(H(l-1),A)
wherein 1 is<L is less than or equal to L, L is the number of GCN layers, xiRepresenting the eigenvectors of the nodes, H representing the propagation function of a certain layer, n representing the total number of nodes, and a representing the adjacency matrix.
Further, the present invention utilizes optimized GCN propagation rules (see: Kipf, Thomas N., and Max welding. "Semi-collaborative classification with graph connected networks." arXiv preprinting: 1609.02907 (2016)), and the function f (,) can be expressed as:
Figure BDA0003236799830000062
where σ (-) is the leak _ relu activation function,
Figure BDA0003236799830000063
i represents an identity matrix;
Figure BDA0003236799830000064
is that
Figure BDA0003236799830000065
Degree matrix (diagonal matrix) of nodes of (1), W(l)Is a weight matrix that needs to be learned.
GCN is respectively utilized to process the graph structures of the scene sketch and the scene image to obtain corresponding graph structure characteristics GSAnd GI
3. Adaptive Graph convolution neural network Module (Adaptive Graph Convolitional Module)
In the existing sketch Retrieval technology (SceneSketcher, references: Liu, Fang, Changqing Zou, Xioming Deng, Ran Zuo, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, and Hongan Wang. "scenescher: Fine-Grained Image Retrieval with Scene sketches." In European Conference on Computer Vision, pp.718-734.Springer, Cham,2020.) a separate graph structure is used to simulate the semantic and spatial relationships between objects, and no strong dependency between object classes In a Scene is noticed. Such dependency is, for example, that the probability of "dog" and "cat" occurring simultaneously is generally greater than the probability of "dog" and "airplan" occurring simultaneously, based on statistics of image content in the database. Therefore, the invention provides an Adaptive Graph convolution neural network module aiming at a fine-grained scene-level SBIR task Based on a space attention mechanism in 2SAGCN (reference: Shi L, Zhang Y, Cheng J, et al, two-Stream Adaptive Graph conditional Networks for Skeleton-Based Action Recognition [ J ]. 2018.). The inclusion-V3 network used for example visual feature extraction in the node construction process is consistent with the visual feature extraction network structure used in the self-adaptive graph convolution neural network module, and parameters are updated in model training respectively. In the whole training process of the retrieval frame, parameters of each visual feature extraction network and the graph convolution neural network module are updated simultaneously, so that an end-to-end (end-to-end) retrieval model frame is constructed.
The method simulates three different graph structures by constructing three different adjacency matrices for the graph structures:
1) fixed adjacency matrix A1Features and spatial layouts representing the class level of the scene sketch. The invention usesEdge A in graph structure constructed in scene graph construction processi,jAs A1
2) In order to better simulate the semantic association relationship between object classes, the invention designs a semantic graph (semantic graph) to simulate the topological structure in class label space. Each category label c is labeled by Word2Vec Word embedding algorithm (reference implementation: https:// codeiCoded as a 300-dimensional word vector
Figure BDA0003236799830000071
Calculating cosine distance (cosine distance) between label word vectors of each category, and measuring association relation (correlation) between categories as edge A of the graph structure2
3)A3Is a learnable adjacency matrix. A. the1And A2Fixed after initialization, in contrast to adjacency matrix A3And continuously updating in the whole network model framework training process. By this data-driven learning approach, the model can learn a particular preferred graph structure for a particular search task and search data set. During model initialization, the invention aims to make A3More flexible, for A3Random initialization is performed.
4) And adding the three adjacent matrixes, and integrating graph structures of different layers to serve as an updated adjacent matrix of the graph convolution neural network.
Fig. 3 shows the structure of the adaptive graph convolution neural network module of the present invention, and the processing flow mainly includes the following steps:
1) inputting a scene sketch, firstly obtaining Nx (f) through a graph structure node construction module of an adaptive graph convolution neural networkv+5) dimensional graph structure feature finWhere N denotes the number of nodes, fvThe dimensionality of visual feature extraction network output is shown, and experiments prove that fv2048, the proposed network model can achieve good results in most cases.
2) Meanwhile, for the input scene sketch, a1 multiplied by N global feature vector is extracted by using a visual feature extraction network.
3) The input f is then convolved with a neural network using the adjacency matrix constructed aboveinCoded as an nxc graph structure feature vector.
4) Finally, the invention multiplies the 1 XN global feature vector and the NXC image structure feature vector to obtain a1 XC scene sketch feature, and the scene sketch feature is used for feature matching and image retrieval of the final scene sketch and the image.
5. Loss function
The invention realizes the feature matching and image retrieval of the scene sketch and the image by using the three-element network. The rationale for a triple network is to make instance features with the same class labels closer together and instance features with different class labels farther apart. The input of the ternary network is (S, I)+,I-) Where S denotes a scene sketch, I+Is a scene image corresponding to the scene sketch, I-Is an image that does not match the input sketch. The loss function of the triplet network can be expressed as:
Ltri=max(d(S,I+)-d(S,I-)+m,0)
where d (-) is a distance function of the feature space for measuring distances between features of the graph structure, the present invention uses Euclidean distances as d (-) and m is a boundary threshold. In the experiment of the present invention, the threshold m was set to 0.4, so that the model obtained stable performance in most of the states.
For input scene sketch GSRespectively with the positive sample scene image GI+Sub-sample scene image GI-Calculation of the distance between, for the input triplet (S, I)+,I-) The present invention utilizes Euclidean distance to obtain d (S, I) respectively+) And d (S, I)-)。
6. Experimental part
Experimental data: the invention includes three experimental data sets: (1) selecting a Sketch and an image containing more than two objects based on a SkyCOCO Scene Sketch data set, constructing Scene Sketch Database, wherein 1225 'Sketch-image' matching pairs (a training set 1015 pair and a test set 210 pair) are included in total, and 14 object categories (bicycle, car, motorcycle, airplan, traffic light, fire water, cat, dog, horse, sweep, cow, horse, etc.); (2) based on a SketchyCOCO Scene Sketch data set, 5000 images with similar categories are selected from the Coco-stuff data set to be used as supplements of a test set to form Extended Scene Sketch Database, namely a training set 1015 pair and the test set comprise 210 test sketches and 5210 test images; (3) the "sketch-image" matching data pairs with the semantic IoU larger than 0.5 are deleted from the SkyScene data set to form the SkyScene Database, and the SkyScene Database comprises 2472 "sketch-image" matching pairs in total (wherein the training set 2472 pairs and the test set 252 pairs).
Comparing the method of the present invention with seven classical sketch-based image retrieval methods, respectively, the results are shown in table 1 below:
TABLE 1
Figure BDA0003236799830000091
(1) HOG + BoW + RankSVM (ref: R.Huand J.Colomosse, "A Performance evaluation of gradient field host descriptor for skin based Image retrieval," Computer Vision and Image Understanding, vol.117, No.7, pp.790-806,2013.).
(2) Dense HOG + RankSVM (ref: Q. Yu, F. Liu, Y. -Z. Song, T. Xiang, T.M. Hospedales, and C. -C.Loy, "Sketch me that shot" in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp.799-807.).
(3) Sketch-a-Net + RankSVM (ref: Yu, Qian, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xiang, and Timothy M.Hospitals. "Sketch-a-Net: A deep neural network at tables turbines." International journal of computer vision 122, No.3(2017): 411-.
(4) Sketch me coat shot (ref: Yu, Qian, Feng Liu, Yi-Zhe Song, Tao Xiao, Timothy M. Hospedales, and Chen-Change Loy. "Sketch me shot," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.799-807.2016.).
(5) DSSA (reference: Song, Jifei, Qian Yu, Yi-Zhe Song, Tao Xian, and Timothy M.Hospitals. "Deep space-time alignment for fine-grained-lined space-based image retrieval." In Proceedings of the IEEE International Conference on Computer Vision, pp.5551-5560.2017.).
(6) SketchyScene (ref: Zou, Changqing, Qian Yu, Ruofei Du, Haoran Mo, Yi-Zhe Song, Tao Xiao, Chengying Gao, Baoquan Chen, and Hao Zhang, "Sketchyne: Richly-innotated scene sketches." In Proceedings of the European Conference on Computer Vision (ECCV), pp.421-436.2018.).
(7) SceneSketcher1.0 (ref: Liu, Fan, Changqing Zou, Xiaming Deng, Ran Zuo, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, and Hongan Wang. "SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches." In European Conference on Computer Vision, pp.718-734.Springer, Cham, 2020.).
Experiments show that the sketch-based image retrieval method provided by the invention has excellent performance.
The method of the present invention has been described in detail, but it is apparent that the specific embodiment of the present invention is not limited thereto. It will be apparent to those skilled in the art that various obvious changes can be made therein without departing from the spirit of the process of the invention and the scope of the claims.

Claims (10)

1. A fine-grained scene-level sketch-based image retrieval method is characterized by comprising the following steps:
1) respectively constructing a graph structure for a scene sketch and a scene image to be retrieved, wherein each node in the graph structure represents an object class in the scene, and edges in the graph structure represent the relationship between the object class and the class in the scene;
2) extracting graph structure characteristics of a scene sketch and a scene image according to the constructed graph structure by using an attention-based adaptive graph convolution neural network;
3) in the training stage, the image data for training is processed in the steps 1) and 2), the image structure characteristics of a scene sketch and a scene image are extracted, the extracted image structure characteristics are subjected to Euclidean distance calculation of positive and negative samples by using a triple network, and triple loss is calculated according to the calculated Euclidean distance; according to the triple loss, the adaptive graph convolution neural network is updated through back propagation, network parameters are optimized, and the trained adaptive graph convolution neural network is obtained;
4) in the testing stage, the scene sketch and the scene image to be retrieved are processed in the steps 1) and 2), the graph structure characteristics of the scene sketch and the scene image are extracted by using the trained self-adaptive graph convolution neural network, and the scene sketch and the scene image to be retrieved are matched according to the Euclidean distance of the graph structure characteristics of the scene sketch and the scene image to be retrieved, so that an image retrieval result is obtained.
2. The method of claim 1, wherein constructing the graph structure comprises constructing nodes and constructing opposite edges, wherein the constructing of the points comprises: clustering the instances in the scene according to the object types, and acquiring node characteristics according to the clustered instances, wherein the node characteristics comprise type labels, visual characteristics of the instances and position information of the instances.
3. The method of claim 2, wherein the step of obtaining node characteristics comprises:
1) detecting a scene sketch or a scene image through a trained target detection network Yolo-V4 to obtain the position and category information of each instance in the scene;
2) extracting visual features of each instance in each category of the scene through a visual feature extraction network inclusion-V3, and connecting the visual features of each instance with position information to form instance features;
3) and performing feature fusion on all the example features in each category through a convolutional neural network to obtain each node feature.
4. The method of claim 3, wherein Yolo-V4 is trained based on a CoCo-Stuff database and inclusion-V3 is trained based on an ImageNet dataset.
5. The method of claim 2, wherein the location information of the instance is represented by a four-dimensional vector whose four-dimensional values represent coordinate points of the upper left corner and the lower right corner of the rectangular bounding box of the instance, respectively.
6. The method of claim 3, wherein the visual features are 2048-dimensional vectors, and the visual features of each instance are connected with the location information to obtain 2052-dimensional instance feature vectors.
7. The method of claim 2, wherein the opposite side is constructed by: and calculating the Euclidean distance of the two nodes, normalizing the Euclidean distance, and constructing an edge by taking the normalized Euclidean distance as the weight of the edge.
8. The method of claim 2, wherein three different graph structures are simulated by constructing three different adjacency matrices, comprising the steps of:
1) calculating Euclidean distances between the center positions of the examples in each category, and normalizing to obtain an edge A1 of the graph structure;
2) extracting Word vectors of all kinds of labels through a Word2Vec Word embedding algorithm, and calculating cosine distances among all kinds of label Word vectors to serve as edges A2 of the graph structure;
3) introducing a learnable adjacency matrix as an edge A3 of the graph structure and randomly initializing the learnable adjacency matrix;
4) three adjacency matrixes are obtained according to the edges A1, A2 and A3, the three adjacency matrixes are added to obtain an updated adjacency matrix of the graph convolution neural network, and the graph structure is represented by the updated adjacency matrix.
9. The method of claim 9, wherein the graph convolution neural network extracts features from the graph structure through an affine function; the graph convolutional neural network has a plurality of network layers, each layer from the second layer taking as input the output of the previous layer and the adjacency matrix of the graph structure.
10. A fine-grained scene-level sketch-based image retrieval system for implementing the method of any one of claims 1-9, comprising:
the target detection network is used for detecting a scene sketch and a scene image to obtain the position and the category information of each instance in the scene;
the visual feature extraction network is used for extracting the visual features of the examples in each category of the scene and connecting the visual features of the examples with the position information to form example features;
the single-layer convolutional neural network is used for fusing all the example characteristics in each category to obtain each node characteristic;
the adaptive graph convolution neural network is used for extracting graph structure characteristics of a scene sketch and a scene image according to the constructed graph structure; calculating Euclidean distance between the two image structure characteristics according to the image structure characteristics of the scene sketch and the scene image, and matching the scene sketch with the scene image to be retrieved;
the triple network is used for carrying out Euclidean distance calculation of positive and negative samples on the extracted graph structure characteristics in a training stage and calculating triple loss according to the calculated Euclidean distance; and according to the triplet loss, performing back propagation to update the self-adaptive graph convolution neural network and optimize network parameters.
CN202111004545.6A 2021-05-08 2021-08-30 Fine-grained scene level sketch-based image retrieval method and system Pending CN113868448A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110497855 2021-05-08
CN202110497855X 2021-05-08

Publications (1)

Publication Number Publication Date
CN113868448A true CN113868448A (en) 2021-12-31

Family

ID=78988863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111004545.6A Pending CN113868448A (en) 2021-05-08 2021-08-30 Fine-grained scene level sketch-based image retrieval method and system

Country Status (1)

Country Link
CN (1) CN113868448A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067215A (en) * 2022-01-17 2022-02-18 东华理工大学南昌校区 Remote sensing image retrieval method based on node attention machine mapping neural network
CN114494499A (en) * 2022-01-26 2022-05-13 电子科技大学 Sketch coloring method based on attention mechanism
CN114485666A (en) * 2022-01-10 2022-05-13 北京科技大学顺德研究生院 Blind person aided navigation method and device based on object association relationship cognitive inference
CN115878833A (en) * 2023-02-20 2023-03-31 中山大学 Appearance patent image retrieval method and system based on hand-drawn sketch semantics

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114485666A (en) * 2022-01-10 2022-05-13 北京科技大学顺德研究生院 Blind person aided navigation method and device based on object association relationship cognitive inference
CN114067215A (en) * 2022-01-17 2022-02-18 东华理工大学南昌校区 Remote sensing image retrieval method based on node attention machine mapping neural network
CN114494499A (en) * 2022-01-26 2022-05-13 电子科技大学 Sketch coloring method based on attention mechanism
CN115878833A (en) * 2023-02-20 2023-03-31 中山大学 Appearance patent image retrieval method and system based on hand-drawn sketch semantics

Similar Documents

Publication Publication Date Title
Bansal et al. Zero-shot object detection
CN108038122B (en) Trademark image retrieval method
CN108734210B (en) Object detection method based on cross-modal multi-scale feature fusion
CN113868448A (en) Fine-grained scene level sketch-based image retrieval method and system
CN112307995B (en) Semi-supervised pedestrian re-identification method based on feature decoupling learning
Zhai et al. One-shot object affordance detection in the wild
CN108170823B (en) Hand-drawn interactive three-dimensional model retrieval method based on high-level semantic attribute understanding
Qian et al. Language-aware weak supervision for salient object detection
Yun et al. Instance GNN: a learning framework for joint symbol segmentation and recognition in online handwritten diagrams
Xu et al. Scene graph inference via multi-scale context modeling
CN110147841A (en) The fine grit classification method for being detected and being divided based on Weakly supervised and unsupervised component
Xu et al. A page object detection method based on mask R-CNN
Zhang et al. 3-D deconvolutional networks for the unsupervised representation learning of human motions
Zhu et al. 2D freehand sketch labeling using CNN and CRF
Wang et al. KTN: Knowledge transfer network for learning multiperson 2D-3D correspondences
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
Astolfi et al. Syntactic pattern recognition in computer vision: A systematic review
Wu et al. MPCT: Multiscale point cloud transformer with a residual network
Henderson Analysis of engineering drawings and raster map images
Oluwasanmi et al. Attentively conditioned generative adversarial network for semantic segmentation
CN116935100A (en) Multi-label image classification method based on feature fusion and self-attention mechanism
CN113516118B (en) Multi-mode cultural resource processing method for joint embedding of images and texts
CN115359486A (en) Method and system for determining custom information in document image
Li et al. Non-Co-Occurrence Enhanced Multi-Label Cross-Modal Hashing Retrieval Based on Graph Convolutional Network
CN112580614A (en) Hand-drawn sketch identification method based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination