CN113868448A

CN113868448A - Fine-grained scene level sketch-based image retrieval method and system

Info

Publication number: CN113868448A
Application number: CN202111004545.6A
Authority: CN
Inventors: 马翠霞; 刘舫; 陈科圻; 邓小明; 王宏安
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-05-08
Filing date: 2021-08-30
Publication date: 2021-12-31

Abstract

The invention discloses a method and a system for retrieving fine-grained scene-level sketch-based images, which belong to the field of computer vision, aim to solve the problem that most of the existing sketch-based image retrieval methods are single-object-oriented and category-level image retrieval, expand the application of sketch retrieval in actual life, establish a self-adaptive graph convolution neural network by introducing an attention mechanism, match a sketch with an image to be retrieved by utilizing a triple network, model the scene sketch and the image to be retrieved on three levels of global layout, category level and example level of a scene, integrate information of each level and realize accurate matching of the input scene sketch and the image.

Description

Fine-grained scene level sketch-based image retrieval method and system

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a method and a system for searching fine-grained scene-level images based on sketches.

Background

Visual media application based on sketch interaction is always a research hotspot in the fields of man-machine interaction, computer vision and multimedia, and how to optimize the basic processing of sketch data and improve the efficiency of the visual media application based on sketch is a key research problem. Sketch interaction is widely applied to various aspects in life and work, including drawing, note shorthand, document marking, webpage User Interface (UI) and concept design in the Internet industry, animation and movie making in the movie animation industry and the like. In recent years, research and application related to sketch interaction have attracted wide attention in both industrial and academic circles, and one of the important reasons is the explosive development of touch screen hardware devices (e.g., microsoft Surface series touch notebook computers, Apple pen, etc.). Under the artificial intelligence era, on one hand, a user can acquire sketch data more conveniently; on the other hand, the performance of the sketch data algorithm based on the deep learning technology is continuously improved. Applications and tasks based on sketch interaction have also gained unprecedented popularity.

Sketch interaction related research and application have attracted wide attention in both the industrial and academic circles, and one of the important reasons is the explosive development of touch screen hardware devices (e.g., smart phones, tablet computers, etc.). Under the artificial intelligence era, on one hand, a user can acquire sketch data more conveniently; on the other hand, the performance of the sketch data algorithm based on the deep learning technology is continuously improved. The research of applications and tasks based on sketch interaction is also developed unprecedentedly, and the sketch interaction tasks and the development trend thereof comprise:

(1) in the aspects of sketch interaction tasks and sketch data processing, sketch processing technologies such as sketch identification, sketch simplification, sketch coloring and the like are widely researched. Sketch-a-net constructs a Sketch recognition algorithm model based on a convolutional neural network, and achieves good Sketch recognition performance (the reference documents are Yu, Qian, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xian, and Timothy M.Hospitals. "Sketch-a-net: A deep neural network at the strategies handbooks." International road of computer vision 122, No.3(2017): 411-. Left to simple, a sketch drawn by a user can be automatically simplified and optimized based on a data-driven method, and operations such as removing messy redundant lines, beautifying lines and the like are included (references: Simo-Serra, Edgar, Satoshi Izuka, Kazuma Sasasaki, and Hiroshi Ishikawa. "left to simple: full volumetric networks for core skin clearance." ACM Transformations On Graphics (TOG)35, No.4(2016): 1-11.).

(2) Some new sketch interaction tasks are also proposed, such as sketch-based generation, sketch interaction-based model generation, sketch abstraction based on reinforcement learning, sketch-based image editing, sketch segmentation based on a graph convolution network. Liu et al propose an Image Retrieval method SceneSkyr based on Scene sketch (reference: Liu, Fan, Changqing Zou, Xialoging Deng, Ran Zuo, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, and hong Wang. "SceneSkyr: Fine-Grained Image Retrieval with Scene sketches." In European Conference on Computer Vision, pp.718-734.Springer, Cham,2020.) capable of retrieving similar Scene images according to the sketch input by the user.

(3) For the study of sketch interaction, the development of fine-grained methods is carried out in response to the requirements of practical application. In particular, some fine-grained sketch interaction tasks have been proposed in recent years relative to overall-based tasks (e.g., sketch recognition). At present, most of image retrieval related technologies based on sketches are established on the premise of instance level and category level retrieval, namely: the input sketch and the image object to be retrieved are both single objects; and, the object of the retrieval result image is consistent with the input sketch object in category, namely the retrieval is correct. Conventional example-level, category-level sketch-based image retrieval methods only focus on retrieving images of the same category, and typically ignore the shape, pose, and other fine-grained attributes of the retrieved images. Compared with the category-level sketch-based image retrieval, the text retrieval can express the category semantics and simultaneously query more simply, so that the traditional sketch-based image retrieval is not widely applied in practice.

The special high abstraction, intuition and conciseness of the sketch enable the sketch to be widely applied in various fields of human-computer interaction, computer vision, multimedia, computer graphics and the like. From the 1960 s to the present, with the continuous improvement of data processing technology, sketch related research and application are continuously optimized. Among them, Sketch-based image retrieval (SBIR) is one of the most widely used and representative Sketch applications, and in the intelligent era, Sketch-based image retrieval also faces new development and challenges.

Disclosure of Invention

The invention aims to provide a fine-grained scene-level sketch-based image retrieval method and system, wherein an attention mechanism is introduced to establish an adaptive image convolution neural network, the scene sketch and the image to be retrieved are modeled on three levels of a global layout, a category level and an example level of a scene, information of each level is integrated, and accurate matching of an input scene sketch and the image is realized.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a fine-grained scene-level sketch-based image retrieval method comprises the following steps:

1) respectively constructing a graph structure (graph) for the scene sketch and a scene image to be retrieved, wherein each node in the graph structure represents an object class in the scene, and edges in the graph structure represent the relationship between the object class and the class in the scene;

2) extracting graph structure characteristics of a scene sketch and a scene image according to the constructed graph structure by using an attention-based adaptive graph convolution neural network;

3) in the training stage, the image data for training is processed in the steps 1) and 2), the graph structure characteristics of a scene sketch and a scene image are extracted, the ternary network (triplet network) is used for carrying out Euclidean distance calculation of positive and negative samples on the extracted graph structure characteristics, and the triplet loss is calculated according to the calculated Euclidean distance; according to the triple loss, the adaptive graph convolution neural network is updated through back propagation, network parameters are optimized, and the trained adaptive graph convolution neural network is obtained;

4) in the testing stage, the scene sketch and the scene image to be retrieved are processed in the steps 1) and 2), the graph structure characteristics of the scene sketch and the scene image are extracted by using the trained self-adaptive graph convolution neural network, and the scene sketch and the scene image to be retrieved are matched according to the Euclidean distance of the graph structure characteristics of the scene sketch and the scene image to be retrieved, so that an image retrieval result is obtained.

Further, the construction of the graph structure comprises the construction of nodes and the construction of opposite sides, wherein the construction method of the nodes comprises the following steps: clustering the instances in the scene according to the object types, and acquiring node characteristics according to the clustered instances, wherein the node characteristics comprise type labels, visual characteristics of the instances and position information of the instances.

Further, the step of obtaining the node characteristics comprises:

1) detecting a scene sketch or a scene image through a trained target detection network Yolo-V4 to obtain the position and category information of each instance in the scene;

2) extracting visual features of each instance in each category of the scene through a visual feature extraction network inclusion-V3, and connecting the visual features of each instance with position information (associate) to form instance features;

3) and performing feature fusion on all the example features in each category through a convolutional neural network to obtain each node feature.

Further, Yolo-V4 was trained based on the CoCo-Stuff database, and inclusion-V3 was trained based on the ImageNet dataset.

Further, the position information of the example is represented by a four-dimensional vector, and the four-dimensional numerical values of the vector respectively represent coordinate points of the upper left corner and the lower right corner of the rectangular bounding box of the example.

Further, the visual features are 2048-dimensional vectors, and 2052-dimensional example feature vectors are obtained by connecting the visual features of each example with the position information.

Further, the construction method of the opposite side comprises the following steps: and calculating the Euclidean distance of the two nodes, normalizing the Euclidean distance, and constructing an edge by taking the normalized Euclidean distance as the weight of the edge.

Further, three different graph structures are simulated by constructing three different adjacency matrices, including the following steps:

1) calculating Euclidean distances between the center positions of the examples in each category, and normalizing to obtain an edge A1 of the graph structure;

2) extracting Word vectors of all kinds of labels through a Word2Vec Word embedding algorithm, and calculating cosine distances among all kinds of label Word vectors to serve as edges A2 of the graph structure;

3) introducing a learnable adjacency matrix as an edge A3 of the graph structure and randomly initializing the learnable adjacency matrix;

4) three adjacency matrixes are obtained according to the edges A1, A2 and A3, the three adjacency matrixes are added to obtain an updated adjacency matrix of the graph convolution neural network, and the graph structure is represented by the updated adjacency matrix.

Further, the graph convolution neural network extracts features from the graph structure through an affine function; the graph convolutional neural network has a plurality of network layers, each layer from the second layer taking as input the output of the previous layer and the adjacency matrix of the graph structure.

A fine-grained scene-level sketch-based image retrieval system, comprising:

the target detection network is used for detecting a scene sketch and a scene image to obtain the position and the category information of each instance in the scene;

the visual feature extraction network is used for extracting the visual features of the examples in each category of the scene and connecting the visual features of the examples with the position information to form example features;

the single-layer convolutional neural network is used for fusing all the example characteristics in each category to obtain each node characteristic;

the adaptive graph convolution neural network is used for extracting graph structure characteristics of a scene sketch and a scene image according to the constructed graph structure; calculating Euclidean distance between the two image structure characteristics according to the image structure characteristics of the scene sketch and the scene image, and matching the scene sketch with the scene image to be retrieved;

the triple network is used for carrying out Euclidean distance calculation of positive and negative samples on the extracted graph structure characteristics in a training stage and calculating triple loss according to the calculated Euclidean distance; and according to the triplet loss, performing back propagation to update the self-adaptive graph convolution neural network and optimize network parameters.

Compared with the prior art, the invention has the beneficial effects that:

1. the method improves the traditional sketch-based image retrieval task, focuses on fine-grained and scene-level image retrieval, can expand the application context of sketch retrieval, and promotes the application of sketch retrieval in actual life.

2. The invention provides a method for representing scene sketch and images by using a graph structure, explicitly simulating object types in the scene by using nodes of the graph structure, and simulating the spatial relationship and semantic relationship between the types in the scene by using the sides of the graph structure, so that a multi-level scene sketch-image retrieval and matching model of global layout-type level-example level can be established.

3. The adaptive graph convolution module based on the attention mechanism can increase the flexibility of the model in the aspect of graph feature and object relation learning, and allows the visual feature extraction network training and graph rolling machine neural network module to be trained simultaneously in an end-to-end mode, so that the retrieval performance of the model is optimized.

Drawings

Fig. 1 is a schematic diagram of a fine-grained scene-level sketch-based image retrieval network structure according to the present invention.

FIG. 2 is a diagram illustrating the structure of a graph node according to the present invention.

FIG. 3 is a diagram of an adaptive graph convolution network module according to the present invention.

Fig. 4 is an example of image retrieval according to the present invention.

FIG. 5 is a comparison of the search results of the present invention and SceneSketcher 1.0.

Detailed Description

The process of the present invention is described in further detail for the purpose of better understanding of the present invention by those skilled in the art, but is not to be construed as limiting the present invention.

The embodiment discloses a fine-grained scene-level sketch-based image retrieval method, which mainly comprises the following steps: providing a scene sketch and a scene image graph structure construction method; introducing an attention mechanism to construct an adaptive graph convolution neural network; the spatial position information of the design diagram structure, the semantic relation and the construction method of the learnable adjacency matrix are jointly used for updating the self-adaptive graph convolution neural network.

Fig. 1 is a schematic diagram of a fine-grained scene-level sketch-based image retrieval network structure provided by the present invention, which includes: (1) the method comprises the steps of inputting a scene sketch, a positive example image matched with the scene sketch and a negative example image not matched with the scene sketch; (2) a self-adaptive graph convolution feature extraction network; (3) a three-tuple network.

The method mainly comprises the following steps:

1. construction process of Scene graph (Scene graph)

The graph structure is represented as G ═ (N, E), where N ═ N_iIs the set of nodes of the graph structure, node n_iThe label of class in the representative scene is c_iThe object class of (1); e ═ E_i,jIs the set of edges, e_i,j＝(n_i,n_j) Is connecting node n_iAnd node n_jThe edge of (2). The set of categories in the scene is denoted C ═ { C ═ C_i}。

The embodiment selects the category of the scene consistent with the node representationUsing the visual characteristics and the position information of each example to construct a node n in the graph structure_iIs characterized by comprising the following steps:

1) from the CoCo-Stuff database (reference: caesar H, Uijlings J, Ferrari V. COCO-Stuff: Thing and Stuff Classes in Context [ J]2016.) selected 16379 images to train the Yolo-V4 target detection network (reference: bochkovskiy A, Wang C Y, Liao H. YOLOv4: Optimal Speed and Accuracy of Object Detection [ J]2020.), and processing the scene sketch and the image to be retrieved by using the target detection network to obtain the position and the type of each object instance in each scene. In particular, given an object class c_i，{o_ijIs the category label in the scene is c_iA set of instances of (c); obtaining each example o through a Yolo-V4 target detection network_ijPosition information p of_ij；

2) Visual feature extraction networks pre-trained on ImageNet datasets (inclusion-V3, ref: szegedy C, Vanhoucke V, Ioffe S, et al]2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016: 2818-2826) model to extract instances o_ijVisual feature v of_ijThe visual feature is a 2048-dimensional vector;

3) example o_ijPosition information p of_ijRepresenting the node object as a 4-dimensional vector, wherein four numbers in the vector respectively represent coordinate points of the upper left corner and the lower right corner of a rectangular bounding box (bounding box) of the node object;

4) example o_ijVisual feature v of_ijPosition information p_ijConnecting to form an example feature vector of 2052 dimensions;

5) class c_iExamples of (a)_ijThe example feature vector is subjected to feature fusion through a single-layer convolutional neural network to obtain a graph structure node n_iFeature vector x of_iRepresenting the class c of objects in the scene_iCharacteristic information of (1).

For the construction of edges in the graph structure, for two nodes n_iAnd n_jDefine the edge e_i,j＝(n_i,n_j) Weight A of_i,jFor normalized euclidean distance:

A_i,j＝1-D_i,j

wherein D_i,j＝||x_j-x_i||²Is a node n_iAnd n_jThe euclidean distance between them.

2. Graph convolution neural network (GCN)

The GCN extracts features from the graph G ═ (N, E) through an affine function f (·,). For each layer of GCN, the input is the output of the GCN of the previous layer and the adjacency matrix A ═ A of the graph structure_i,j}. The propagation function of the GCN base network at layer i can be written as:

H^(l)＝f(H^(l-1),A)

wherein 1 is<L is less than or equal to L, L is the number of GCN layers, x_iRepresenting the eigenvectors of the nodes, H representing the propagation function of a certain layer, n representing the total number of nodes, and a representing the adjacency matrix.

Further, the present invention utilizes optimized GCN propagation rules (see: Kipf, Thomas N., and Max welding. "Semi-collaborative classification with graph connected networks." arXiv preprinting: 1609.02907 (2016)), and the function f (,) can be expressed as:

where σ (-) is the leak _ relu activation function,

i represents an identity matrix;

is that

Degree matrix (diagonal matrix) of nodes of (1), W^(l)Is a weight matrix that needs to be learned.

GCN is respectively utilized to process the graph structures of the scene sketch and the scene image to obtain corresponding graph structure characteristics G_SAnd G_I。

3. Adaptive Graph convolution neural network Module (Adaptive Graph Convolitional Module)

In the existing sketch Retrieval technology (SceneSketcher, references: Liu, Fang, Changqing Zou, Xioming Deng, Ran Zuo, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, and Hongan Wang. "scenescher: Fine-Grained Image Retrieval with Scene sketches." In European Conference on Computer Vision, pp.718-734.Springer, Cham,2020.) a separate graph structure is used to simulate the semantic and spatial relationships between objects, and no strong dependency between object classes In a Scene is noticed. Such dependency is, for example, that the probability of "dog" and "cat" occurring simultaneously is generally greater than the probability of "dog" and "airplan" occurring simultaneously, based on statistics of image content in the database. Therefore, the invention provides an Adaptive Graph convolution neural network module aiming at a fine-grained scene-level SBIR task Based on a space attention mechanism in 2SAGCN (reference: Shi L, Zhang Y, Cheng J, et al, two-Stream Adaptive Graph conditional Networks for Skeleton-Based Action Recognition [ J ]. 2018.). The inclusion-V3 network used for example visual feature extraction in the node construction process is consistent with the visual feature extraction network structure used in the self-adaptive graph convolution neural network module, and parameters are updated in model training respectively. In the whole training process of the retrieval frame, parameters of each visual feature extraction network and the graph convolution neural network module are updated simultaneously, so that an end-to-end (end-to-end) retrieval model frame is constructed.

The method simulates three different graph structures by constructing three different adjacency matrices for the graph structures:

1) fixed adjacency matrix A₁Features and spatial layouts representing the class level of the scene sketch. The invention usesEdge A in graph structure constructed in scene graph construction process_i,jAs A₁。

2) In order to better simulate the semantic association relationship between object classes, the invention designs a semantic graph (semantic graph) to simulate the topological structure in class label space. Each category label c is labeled by Word2Vec Word embedding algorithm (reference implementation: https:// code_iCoded as a 300-dimensional word vector

Calculating cosine distance (cosine distance) between label word vectors of each category, and measuring association relation (correlation) between categories as edge A of the graph structure₂；

3)A₃Is a learnable adjacency matrix. A. the₁And A₂Fixed after initialization, in contrast to adjacency matrix A₃And continuously updating in the whole network model framework training process. By this data-driven learning approach, the model can learn a particular preferred graph structure for a particular search task and search data set. During model initialization, the invention aims to make A₃More flexible, for A₃Random initialization is performed.

4) And adding the three adjacent matrixes, and integrating graph structures of different layers to serve as an updated adjacent matrix of the graph convolution neural network.

Fig. 3 shows the structure of the adaptive graph convolution neural network module of the present invention, and the processing flow mainly includes the following steps:

1) inputting a scene sketch, firstly obtaining Nx (f) through a graph structure node construction module of an adaptive graph convolution neural network_v+5) dimensional graph structure feature f_inWhere N denotes the number of nodes, f_vThe dimensionality of visual feature extraction network output is shown, and experiments prove that f_v2048, the proposed network model can achieve good results in most cases.

2) Meanwhile, for the input scene sketch, a1 multiplied by N global feature vector is extracted by using a visual feature extraction network.

3) The input f is then convolved with a neural network using the adjacency matrix constructed above_inCoded as an nxc graph structure feature vector.

4) Finally, the invention multiplies the 1 XN global feature vector and the NXC image structure feature vector to obtain a1 XC scene sketch feature, and the scene sketch feature is used for feature matching and image retrieval of the final scene sketch and the image.

5. Loss function

The invention realizes the feature matching and image retrieval of the scene sketch and the image by using the three-element network. The rationale for a triple network is to make instance features with the same class labels closer together and instance features with different class labels farther apart. The input of the ternary network is (S, I)⁺,I^-) Where S denotes a scene sketch, I⁺Is a scene image corresponding to the scene sketch, I^-Is an image that does not match the input sketch. The loss function of the triplet network can be expressed as:

L_tri＝max(d(S,I⁺)-d(S,I^-)+m,0)

where d (-) is a distance function of the feature space for measuring distances between features of the graph structure, the present invention uses Euclidean distances as d (-) and m is a boundary threshold. In the experiment of the present invention, the threshold m was set to 0.4, so that the model obtained stable performance in most of the states.

For input scene sketch G_SRespectively with the positive sample scene image G_I+Sub-sample scene image G_I-Calculation of the distance between, for the input triplet (S, I)⁺,I^-) The present invention utilizes Euclidean distance to obtain d (S, I) respectively⁺) And d (S, I)^-)。

6. Experimental part

Experimental data: the invention includes three experimental data sets: (1) selecting a Sketch and an image containing more than two objects based on a SkyCOCO Scene Sketch data set, constructing Scene Sketch Database, wherein 1225 'Sketch-image' matching pairs (a training set 1015 pair and a test set 210 pair) are included in total, and 14 object categories (bicycle, car, motorcycle, airplan, traffic light, fire water, cat, dog, horse, sweep, cow, horse, etc.); (2) based on a SketchyCOCO Scene Sketch data set, 5000 images with similar categories are selected from the Coco-stuff data set to be used as supplements of a test set to form Extended Scene Sketch Database, namely a training set 1015 pair and the test set comprise 210 test sketches and 5210 test images; (3) the "sketch-image" matching data pairs with the semantic IoU larger than 0.5 are deleted from the SkyScene data set to form the SkyScene Database, and the SkyScene Database comprises 2472 "sketch-image" matching pairs in total (wherein the training set 2472 pairs and the test set 252 pairs).

Comparing the method of the present invention with seven classical sketch-based image retrieval methods, respectively, the results are shown in table 1 below:

TABLE 1

(1) HOG + BoW + RankSVM (ref: R.Huand J.Colomosse, "A Performance evaluation of gradient field host descriptor for skin based Image retrieval," Computer Vision and Image Understanding, vol.117, No.7, pp.790-806,2013.).

(2) Dense HOG + RankSVM (ref: Q. Yu, F. Liu, Y. -Z. Song, T. Xiang, T.M. Hospedales, and C. -C.Loy, "Sketch me that shot" in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp.799-807.).

(3) Sketch-a-Net + RankSVM (ref: Yu, Qian, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xiang, and Timothy M.Hospitals. "Sketch-a-Net: A deep neural network at tables turbines." International journal of computer vision 122, No.3(2017): 411-.

(4) Sketch me coat shot (ref: Yu, Qian, Feng Liu, Yi-Zhe Song, Tao Xiao, Timothy M. Hospedales, and Chen-Change Loy. "Sketch me shot," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.799-807.2016.).

(5) DSSA (reference: Song, Jifei, Qian Yu, Yi-Zhe Song, Tao Xian, and Timothy M.Hospitals. "Deep space-time alignment for fine-grained-lined space-based image retrieval." In Proceedings of the IEEE International Conference on Computer Vision, pp.5551-5560.2017.).

(6) SketchyScene (ref: Zou, Changqing, Qian Yu, Ruofei Du, Haoran Mo, Yi-Zhe Song, Tao Xiao, Chengying Gao, Baoquan Chen, and Hao Zhang, "Sketchyne: Richly-innotated scene sketches." In Proceedings of the European Conference on Computer Vision (ECCV), pp.421-436.2018.).

(7) SceneSketcher1.0 (ref: Liu, Fan, Changqing Zou, Xiaming Deng, Ran Zuo, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, and Hongan Wang. "SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches." In European Conference on Computer Vision, pp.718-734.Springer, Cham, 2020.).

Experiments show that the sketch-based image retrieval method provided by the invention has excellent performance.

The method of the present invention has been described in detail, but it is apparent that the specific embodiment of the present invention is not limited thereto. It will be apparent to those skilled in the art that various obvious changes can be made therein without departing from the spirit of the process of the invention and the scope of the claims.

Claims

1. A fine-grained scene-level sketch-based image retrieval method is characterized by comprising the following steps:

1) respectively constructing a graph structure for a scene sketch and a scene image to be retrieved, wherein each node in the graph structure represents an object class in the scene, and edges in the graph structure represent the relationship between the object class and the class in the scene;

3) in the training stage, the image data for training is processed in the steps 1) and 2), the image structure characteristics of a scene sketch and a scene image are extracted, the extracted image structure characteristics are subjected to Euclidean distance calculation of positive and negative samples by using a triple network, and triple loss is calculated according to the calculated Euclidean distance; according to the triple loss, the adaptive graph convolution neural network is updated through back propagation, network parameters are optimized, and the trained adaptive graph convolution neural network is obtained;

2. The method of claim 1, wherein constructing the graph structure comprises constructing nodes and constructing opposite edges, wherein the constructing of the points comprises: clustering the instances in the scene according to the object types, and acquiring node characteristics according to the clustered instances, wherein the node characteristics comprise type labels, visual characteristics of the instances and position information of the instances.

3. The method of claim 2, wherein the step of obtaining node characteristics comprises:

2) extracting visual features of each instance in each category of the scene through a visual feature extraction network inclusion-V3, and connecting the visual features of each instance with position information to form instance features;

4. The method of claim 3, wherein Yolo-V4 is trained based on a CoCo-Stuff database and inclusion-V3 is trained based on an ImageNet dataset.

5. The method of claim 2, wherein the location information of the instance is represented by a four-dimensional vector whose four-dimensional values represent coordinate points of the upper left corner and the lower right corner of the rectangular bounding box of the instance, respectively.

6. The method of claim 3, wherein the visual features are 2048-dimensional vectors, and the visual features of each instance are connected with the location information to obtain 2052-dimensional instance feature vectors.

7. The method of claim 2, wherein the opposite side is constructed by: and calculating the Euclidean distance of the two nodes, normalizing the Euclidean distance, and constructing an edge by taking the normalized Euclidean distance as the weight of the edge.

8. The method of claim 2, wherein three different graph structures are simulated by constructing three different adjacency matrices, comprising the steps of:

9. The method of claim 9, wherein the graph convolution neural network extracts features from the graph structure through an affine function; the graph convolutional neural network has a plurality of network layers, each layer from the second layer taking as input the output of the previous layer and the adjacency matrix of the graph structure.

10. A fine-grained scene-level sketch-based image retrieval system for implementing the method of any one of claims 1-9, comprising: