CN113886615B

CN113886615B - Hand-drawing image real-time retrieval method based on multi-granularity associative learning

Info

Publication number: CN113886615B
Application number: CN202111241283.5A
Authority: CN
Inventors: 戴大伟; 刘颖格; 唐晓宇; 夏书银; 王国胤
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2024-06-04
Anticipated expiration: 2041-10-25
Also published as: CN113886615A

Abstract

The invention belongs to the field of image retrieval, and particularly relates to a hand-drawn image real-time retrieval method for multi-granularity associative learning, which comprises the following steps: training an improved deep neural network model by adopting a triplet state loss function and a multi-granularity associative learning method, extracting an embedded vector of a sketch branch by the trained deep neural network model, sending the embedded vector to a discriminator to judge the grade of the sketch branch, sending the sketch branch to a dimension reduction layer corresponding to the grade, calculating the Euclidean distance between the sketch branch and an image, and returning the retrieved top-k pictures according to the Euclidean distance; the invention designs a multi-stage model, avoids the diversity confusion of incomplete sketches, and provides a multi-granularity association learning method of the progressive incomplete sketches, so that the embedding space of each incomplete sketch approximates to the embedding space of a subsequent sketch and a corresponding target photo, and the target photo is searched out by the minimum sketch strokes.

Description

Hand-drawing image real-time retrieval method based on multi-granularity associative learning

Technical Field

The invention belongs to the field of dynamic sketch retrieval, and particularly relates to a hand-drawn image real-time retrieval method based on multi-granularity associative learning.

Background

Image retrieval is classified into a sample-based image retrieval (EBIR) and a sketch-based image retrieval (SBIR) according to the type of picture retrieved. SBIR is a method of using a hand-drawn sketch lacking color information and texture information as input, and then the retrieval system returns an image library image similar to the hand-drawn sketch. The hand-drawn sketch related in the method is an abstract expression form of human beings on objects to be seen, and unlike texts and labels, the hand-drawn sketch can transmit image information which is difficult to express by characters in a more visual and image mode, so that dissimilarity of the information in the transmission process is effectively prevented. For example, when a user wants to inquire about a certain commodity, and the user lacks knowledge about the commodity and cannot provide picture information or text description, the user can simply draw the shape characteristics of the commodity by virtue of the impression, and search the corresponding commodity through a hand sketch. Nowadays, touch devices are rapidly developed, wherein the popularization of intelligent mobile terminals with touch screen functions such as phones, tablets and the like provides hand-drawing and handwriting input conditions for vast users, so that the frequency of transmitting information by adopting hand-drawing sketches in scenes such as daily life, work, entertainment and the like is continuously increased, and the sketch-based image retrieval is particularly focused due to the potential commercial value of the sketch-based image retrieval.

The main advantage of hand-drawn sketch-based image retrieval compared to text/label-based retrieval is the fine granularity, thus deriving fine-grained sketch retrieval (FG-SBIR), which performs image matching for details of the hand-drawn sketch, aimed at retrieving specific photos in the gallery. Considerable progress has been made in the research of FG-SBIR, but there are two problems in sketching that prevent the wide use of FG-SBIR in practice: (1) insufficient mapping skills of the user; (2) the time required to draw a complete sketch. In the case of reference pictures, sketches drawn by different sketchers for the same object are different in abstract degree, which leads to different sketch forms; without the reference picture, different painters can only complete conception and drawing by means of subjective impressions of themselves, which in turn greatly increases the diversity of sketch forms. Secondly, the drawing level and the drawing style of each person are different, so that the difference of the drawn sketches in style is further increased, the difference of sketch data in semantic association is caused, and the difficulty of sketch semantic understanding is increased. While most advanced vision systems are good at identifying poorly sketched drawings, the time required to draw a complete sketch depends on the drawing capabilities of the plotter, and this waiting time is too long if the result can be retrieved after the complete sketch is drawn. In practical applications, the fastest retrieval of the desired commodity using the least stroke information is a key in real-time retrieval.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a hand drawing image real-time retrieval method based on multi-granularity associative learning, an improved neural network model is provided based on the hand drawing image real-time retrieval method based on multi-granularity associative learning, the improved neural network model comprises three branches f1, f2 and f3, f1 is a pre-training network, f2 is an attention layer, f3 is a dimension reduction layer, a training set of the improved neural network model is an image set formed by a plurality of images and complete sketches corresponding to the images, the complete sketches of each image in the image set are rendered into a plurality of sketches according to the stroke sequence of the drawing, namely a plurality of images, a sketch branch set of the image is constructed through the rendered images, and one image in the image set is selected as a target image in each training;

After training is completed by training the improved neural network model, inputting the hand-drawn image to search the image in real time, wherein the training process of the improved neural network model comprises the following steps:

S0, training three branches f1, f2 and f3 of a neural network model by adopting a triple loss function triple loss according to a hand sketch corresponding to an image in the image set, and fixing parameters after training is finished;

S1, classifying each picture in a sketch branch of a target image according to the number of strokes required for drawing the target image so as to avoid a diversity confusion model of incomplete sketches;

S2, extracting the feature vector of the target image and the feature vector of each picture in the sketch branch through a pre-training network, and obtaining the embedded vector of the target image and the embedded vector of each picture in the sketch branch by adopting an attention mechanism of an attention layer;

S3, sending the embedded vector of the picture into a dimension reduction layer corresponding to the grade to which the picture belongs according to the grade of the picture division;

S4, after the dimension of the embedded vector of the picture is reduced in the dimension reduction layer corresponding to the grade, associating the picture with the picture in the next grade, calculating the mean square loss of the current grade and the picture in the next grade by adopting a mean square loss function MSE loss, and updating the dimension reduction layer by taking the calculated mean square loss as a loss function; this process is repeated until all levels of mean square loss computation are complete.

S5, calculating errors of each picture and the image concentrated image in the sketch branch by adopting a Triplet loss function, adding the errors with errors of all levels, carrying out counter propagation, taking images, except the target image, in the close image and the far image concentrated image as parameters in a target adjustment model, approaching an embedded vector between the picture and the target image, and simultaneously approaching embedded vectors between two adjacent levels;

S6, acquiring a sketch branch of the next target image, and repeating the steps S1-S5 until the model reaches the upper limit of training times.

Further, rendering a complete sketch of an image into N pictures according to the stroke sequence of drawing, wherein the N pictures form a sketch branch, each picture in the sketch branch comprises a first stroke to an nth stroke of the complete sketch, the strokes of each picture are different, N is more than or equal to 1 and less than or equal to N, the pictures are arranged in ascending order according to the stroke number contained in the pictures, and then one sketch branch S= { S ₁,s₂,...,s_n...,s_N},s_n represents the picture containing the strokes of the first stroke to the nth stroke.

Further, an attention mechanism is adopted to obtain an embedded vector of each picture in the sketch branch, and the expression is as follows:

V_H＝Global_pooling(B+B.f_att(B))

Wherein B is a feature vector obtained after passing through the pre-training network, f _att () represents an attention mechanism, global_ pooling (x) represents Global pooling of an embedded vector obtained through an attention layer, and V _H represents an embedded vector further obtained after Global pooling of a sketch branch.

Further, each picture of the sketch branch is classified according to the number of strokes, and each grade is designed with an independent dimension reduction layer, which is also called a linear mapping layer, and the expression of the dimension reduction layer is as follows:

V_L＝A.V_H

Wherein A represents a linear mapping, and V _L represents an embedded vector of a sketch branch after dimension reduction.

Furthermore, each level is provided with a corresponding dimension reduction layer, the dimension reduction layer maps 2048-dimension embedded vectors to 64 dimensions, and a multi-granularity associative learning method is adopted to realize the approximation of the feature vector space of the incomplete hand-drawn image to the feature vector space of the relative complete hand-drawn image so as to further optimize the feature vector space of the incomplete hand-drawn image.

Further, the step S1 includes:

if the strokes needed for drawing a complete sketch are N strokes, N pictures are contained in a sketch branch after the complete sketch is rendered;

When the grades are classified, the 1 st to the m-th pictures in the sketch branches are classified into a first grade, namely the first m-th pictures are classified into a first grade, and the m+1st to the 2-th pictures are classified into a second grade, namely the 1 st to the 2-th pictures are classified into a second grade; each level is added with m pictures in turn, namely m strokes;

If P is an integer, p=n/m, the N pictures are divided into P levels altogether, and if P is not an integer, the N pictures are divided into p+1 levels altogether.

Further, the step S1 includes:

If drawing a complete sketch requires that the strokes are N strokes, the sketch branches after the complete sketch is rendered contain N pictures, m _k is the number of pictures contained in the kth level, the picture levels are divided by adopting a completeness discriminator according to a formula, the number of pictures contained in each level is sequentially reduced, and the number of pictures contained in the kth level is expressed as follows:

for images with fewer strokes, the number of grades required to be divided is reduced, the calculation pressure of a computer is reduced, and the retrieval efficiency is improved.

Further, the step S4 includes:

calculating the mean square loss of the picture x _i in the ith grade and the picture x _i+1 randomly selected each time in the ith grade according to the sequence of the strokes from less to more in the process of approaching the ith grade to the (i+1) th grade, sequentially adding the mean square loss of each picture in the ith grade and the mean square loss of the picture in the next grade to obtain the mean square error of the ith grade, approaching the ith grade to the (i+1) th grade, and expressing the mean square loss of the picture x _i in the ith grade and the picture x _i+1 in the next grade as:

MSE Loss＝ω(x_i+1-x_i)²

Wherein ω >0.

Further, the expression of the triplet loss function is:

Wherein m represents the number of pictures co-rendered by a complete sketch; v _[i,j] represents the embedded vector of the ith picture in the sketch branch, and is obtained after passing through the dimension reduction layer; v _[i+1,rnd] represents the i+1st picture of the sketch branch; v _p denotes the positive sample obtained after passing through the training network and the attention layer, i.e., the embedded vector of the target photograph, v _n denotes the negative sample obtained after passing through the pre-training network and the attention layer, i.e., the embedded vector of the image other than the target image in the image set, α is a constant, and d is the euclidean distance.

The invention develops a multi-stage model aiming at sketch branches with different integrality to avoid a diversity confusion model of incomplete sketches, and provides a multi-granularity association learning method of progressive incomplete sketches, so that the embedding space of each incomplete sketch is similar to the embedding space of a subsequent sketch and a corresponding target photo, and the target photo is searched out with the least sketch strokes as much as possible, thereby reducing the searching time of the hand-drawn sketch and improving the searching efficiency.

Drawings

FIG. 1 is a diagram of a deep neural network backbone model of the present invention;

FIG. 2 is a diagram of a deep neural network model of the present invention;

FIG. 3 is a diagram of a multi-granularity joint learning retrieval model of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A hand-drawn sketch real-time retrieval method based on multi-granularity associative learning is shown in figures 1-3 and comprises the following steps:

obtaining a complete sketch of a target image from a QMUL-shell-V2 dataset and a QMUL-Chair-V2 dataset, rendering the complete sketch into N pictures according to the stroke sequence of drawing, wherein N pictures form a sketch branch, each picture in the sketch branch comprises first to nth strokes of the complete sketch, strokes of each picture are different, N is more than or equal to 1 and less than or equal to N, the pictures are arranged in ascending order according to the stroke number contained in the pictures, and then one sketch branch S= { S ₁,s₂,...,s_n...,s_N},s_n represents the picture comprising the first to nth strokes;

specifically, the provider of QMUL-Shoe-V2 dataset and QMUL-Chair-V2 dataset found volunteers on different painting bases to have them draw a complete sketch from the target image hand.

Specifically, as shown in fig. 3, for a complete sketch, rendering the complete sketch into N pictures according to the completeness of the sketch, where the N pictures are a sketch branch, and each picture in the sketch branch includes the first pen to the nth pen of the complete sketch. For example: the first picture in the sketch branch only contains the first stroke of the complete sketch, the second picture contains the first stroke and the second stroke of the complete sketch, the third picture contains the first stroke, the second stroke, the third stroke and so on of the complete sketch.

The method comprises the steps of forming an image set by a plurality of images and the complete sketch corresponding to the images, obtaining the complete sketch of each image in the image set, rendering the complete sketch of each image into a plurality of pictures according to the stroke sequence of drawing, forming a sketch branch of one image by the plurality of pictures, finishing the rendering process of the complete sketch of all the images before training, and selecting one image in the image set as a target image in each training.

S0, training three branches f1, f2 and f3 of a neural network model by adopting a triple loss function triple loss according to a hand-drawn sketch corresponding to an image in an image set, fixing parameters after training, as shown in FIG. 1, wherein a positive sample is a target image, the sketch is a hand-drawn complete sketch corresponding to the target image, a negative sample is an image except the target image in the image set, and fixing parameters after training is completed by adopting the three branches of the triple loss training neural network model;

S1, classifying each picture in a sketch branch of a target image according to the number of strokes required for drawing the target image;

each level is provided with a corresponding dimension reduction layer, the dimension reduction layer maps 2048-dimension embedded vectors to 64 dimensions, and a multi-granularity associative learning method is adopted to realize the approximation of the feature vector space of the incomplete hand-drawn image to the feature vector space of the relative complete hand-drawn image so as to further optimize the feature vector space of the incomplete hand-drawn image.

S4, after the dimension of the embedded vector of the picture is reduced in the dimension reduction layer corresponding to the grade, associating the picture with the picture in the next grade, calculating the mean square loss of the current grade and the picture in the next grade by adopting a mean square loss function MSE loss, and updating the dimension reduction layer by taking the calculated mean square loss as a loss function; repeating the process until the mean square loss calculation of all the levels is completed;

MSE Loss＝ω(x_i+1-x_i)²

Wherein ω >0.

S5, calculating errors of each picture and the target image in the sketch branch by using a Triplet loss function, adding the errors with errors of all levels, carrying out counter propagation, taking images, except the target image, in a set of close-to-target image and far-away-from-image as parameters in a target adjustment model, approaching an embedded vector between the picture and the target image, and simultaneously approaching embedded vectors between two adjacent levels;

In one embodiment, as shown in fig. 3, step S1 of grading adopts the same stroke number sharing mode, wherein each two pictures contain similar strokes, each two grades also contain similar strokes, and the method comprises the following steps:

Specifically, taking 20 strokes required for drawing a complete sketch as an example, when the 20 pictures are included in a sketch branch after the complete sketch is rendered, dividing the 1 st to 5 th pictures in the sketch branch into a first grade (namely dividing the first five strokes into one grade), dividing the 6 th to 10 th pictures into a second grade (namely dividing the 1 st to 10 th strokes into the second grade), dividing the 11 th to 15 th pictures into a third grade (namely dividing the 1 st to 15 th strokes into the third grade), and dividing the 16 th to 20 th pictures into a fourth grade (namely dividing the 1 st to 20 th strokes into the fourth grade).

In another embodiment, step S1 includes:

If the strokes needed for drawing a complete sketch are N strokes, N pictures are contained in a sketch branch after the complete sketch is rendered, m _k is the number of pictures contained in the kth level, the picture levels are divided according to a formula by adopting a completeness discriminator, and the number of pictures contained in each level is sequentially reduced. The method reduces the number of grades to be divided for the images with fewer strokes, reduces the calculation pressure of a computer and improves the retrieval efficiency.

Preferably, an attention mechanism is adopted to obtain an embedded vector of each picture in the sketch branch, and the expression is:

V_H＝Global_pooling(B+B.f_att(B))

Preferably, each picture in the sketch branch is classified according to the stroke number, and each class is designed with a separate dimension reduction layer, which is also called a linear mapping layer, and the expression of the dimension reduction layer is as follows:

V_L＝A.V_H

Preferably, the expression of the triplet loss function is:

Wherein m represents the number of pictures co-rendered by a complete sketch; v _[i,j] represents the embedded vector of the ith picture in the sketch branch, and is obtained after passing through the dimension reduction layer; v _[i+1,rnd] represents the i+1st picture in the sketch branch; v _p denotes the positive sample obtained after passing through the pre-training network and the attention layer, i.e., the embedded vector of the target image, v _n denotes the negative sample obtained after passing through the pre-training network and the attention layer, i.e., the embedded vector of the image other than the target image in the image set, α is a constant, and d is the euclidean distance.

The process of real-time retrieval of hand-drawn images comprises the following steps:

S21, taking a sketch drawn on a drawing board by a user as an original sketch, and forming a picture every time one pen is added according to the drawing stroke sequence;

s22, sending the picture of the current stroke into a pre-training network and an attention layer to obtain an embedded vector of the picture of the current stroke;

s23, sending the picture of the current stroke to an integrity discriminator to judge the grade of the picture, and sending the grade to a dimension reduction layer corresponding to the grade;

S24, after the dimension of the embedded vector of the picture is reduced by the dimension reduction layer corresponding to the grade, calculating the Euclidean distance between the picture and the image in the database;

S25, returning the retrieved k pictures according to Euclidean distance between the pictures and the images in the database

S26, acquiring the next stroke drawn by the user, and repeating the steps S22-S25 until the target picture is searched or all strokes are searched.

As shown in fig. 2, the network structure of the deep neural network model is divided into two parts, wherein one part is to train an image set containing a target image by adopting a Triplet loss, and the image set obtains a required embedded vector through three branches f ₁、f₂ and f ₃ of the model; and the other part adopts a triple loss and MSE loss to train a sketch branch, the sketch branch needs to judge the picture level after passing through f ₁、f₂ of the model, and the picture level is sent into a corresponding f ₃₁、f₃₂、……f_3k to obtain a required embedded vector.

When no commodity picture exists and the text is difficult to describe the commodity, a user can manually draw a commodity sketch on the touch screen device by means of the image of the commodity, the commodity sketch is input into a trained neural network model after being rendered into a sketch branch, and the model returns k images which are most similar to the commodity sketch through the retrieval of the sketch branch.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A hand-drawn image real-time retrieval method based on multi-granularity associative learning is characterized in that an improved neural network model is provided by the hand-drawn image real-time retrieval method based on multi-granularity associative learning, the improved neural network model comprises three branches f ₁、f₂ and f ₃, f ₁ is a pre-training network, f ₂ is an attention layer, f ₃ is a dimension reduction layer, a training set of the improved neural network model is an image set composed of a plurality of images and complete sketches corresponding to the images, the complete sketches of each image in the image set are rendered into a plurality of sketches according to the stroke sequence of the drawing, the sketch branch set of the image set is constructed after the complete sketches are rendered, and one image in the image set is selected as a target image in each training;

S0, training three branches f1, f2 and f3 of a neural network model by adopting a triple loss function according to a hand sketch corresponding to an image in the image set, and fixing parameters after training is finished;

S4, after the dimension of the embedded vector of the picture is reduced in the dimension reduction layer corresponding to the grade, associating the picture with the picture in the next grade, calculating the mean square loss of the picture in the current grade and the picture in the next grade by adopting a mean square loss function, and updating the dimension reduction layer by taking the calculated mean square loss as a loss function; repeating the process until the mean square loss calculation of all the levels is completed;

S5, calculating errors of each picture and the image in the image set in the sketch branch by adopting a triplet loss function, adding the errors with errors of all levels, carrying out counter propagation, taking images, except the target image, in the image set close to the target image and the image set far away from the target image as parameters in a target adjustment model, approaching an embedded vector between the picture and the target image, and simultaneously approaching embedded vectors between two adjacent levels;

2. The method for searching the hand-drawn image in real time based on multi-granularity associative learning according to claim 1, wherein the complete sketch of one image is rendered into N pictures according to the stroke sequence of drawing, the N pictures form a sketch branch, each picture in the sketch branch comprises first to nth strokes of the complete sketch, each picture has different strokes, N is more than or equal to 1 and less than or equal to N, the pictures are arranged according to the ascending sequence of the strokes contained in the pictures, and one sketch branch S= { S ₁,s₂,...,s_n...,s_N},s_n represents the picture containing the first to nth strokes.

3. The method for searching the hand-drawn images in real time based on multi-granularity associative learning according to claim 1, wherein an attention mechanism is adopted to obtain an embedded vector of each picture in a sketch branch, and the expression is as follows:

V_H＝Global_pooling(B+B·f_att(B))

4. A method for searching hand-drawn images based on multi-granularity associative learning in real time according to claim 1 or 3, wherein each picture in the sketch branches is classified according to the stroke number, and each class is designed with a separate dimension-reducing layer, which is also called a linear mapping layer, and the expression:

V_L＝A·V_H

5. The method for real-time searching of hand-drawn images based on multi-granularity associative learning according to claim 4, wherein each level is provided with a corresponding dimension reduction layer, the dimension reduction layer maps 2048-dimensional embedded vectors to 64-dimensional, and the multi-granularity associative learning method is adopted to achieve the approximation of the feature vector space of the incomplete hand-drawn image to the feature vector space of the relatively complete hand-drawn image so as to further optimize the feature vector space of the incomplete hand-drawn image.

6. The method for real-time searching of hand-drawn images based on multi-granularity associative learning according to claim 1, wherein the step S1 comprises:

7. The method for real-time searching of hand-drawn images based on multi-granularity associative learning according to claim 1, wherein the step S1 comprises:

8. The method for real-time searching of hand-drawn images based on multi-granularity associative learning according to claim 1, wherein the step S4 comprises:

MSE Loss＝ω(x_i+1-x_i)²

Wherein ω >0.

9. The method for searching hand-drawn images in real time based on multi-granularity associative learning according to claim 1 or 2, wherein the expression of the triplet loss function is:

wherein m represents the number of pictures co-rendered by a complete sketch; v _[i,j] represents the embedded vector of the ith picture in the sketch branch, and is obtained after passing through the dimension reduction layer; v _p denotes the positive sample obtained after passing through the pre-training network and the attention layer, i.e., the embedded vector of the target image, v _n denotes the negative sample obtained after passing through the pre-training network and the attention layer, i.e., the embedded vector of the image other than the target image in the image set, α is a constant, and d is the euclidean distance.