CN110634170B

CN110634170B - Photo-level image generation method based on semantic content and rapid image retrieval

Info

Publication number: CN110634170B
Application number: CN201910813199.2A
Authority: CN
Inventors: 薛雨阳; 浦佳祺; 薛裕明; 李�根; 童同; 高钦泉
Original assignee: Fujian Imperial Vision Information Technology Co ltd
Current assignee: Fujian Imperial Vision Information Technology Co ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2022-09-13
Anticipated expiration: 2039-08-30
Also published as: CN110634170A

Abstract

The invention discloses a photo-level image generation method based on semantic content and rapid image retrieval, which comprises a background generation part and a foreground production part, wherein the background generation part generates a network through countermeasure, so that a generated background picture is vivid and is close to a real scene; the foreground generation part uses computer vision to perform blocking processing on the doodle data and then inputs the doodle data into an identification network of a depth model to perform doodle identification, indexes the doodle data from a corresponding picture database through the obtained labels according to a feedback result of the identification network, obtains a picture with the highest approximation degree by adopting a nearest neighbor model and fuses the picture to a corresponding position of a background, generates a foreground by adopting an image retrieval mode, and finally superposes the foreground and the background, so that the generated picture is more real and perfect.

Description

Photo-level image generation method based on semantic content and rapid image retrieval

Technical Field

The invention relates to the field of image generation and image retrieval, in particular to a method for generating a live-action picture based on a semantic graph and image retrieval.

Background

The image generation based on the deep learning is a newer artificial intelligence technology, aiming at simple user interaction and priori knowledge of images, and the artificial intelligence technology is adopted to automatically generate corresponding pictures according to a composition mode drawn by a user. With the rise of the countermeasure generation network, image and video content generation has been popularized in the hands of general users. At present, not only through computers, but also mobile devices based on single-chip microcomputers, such as mobile phones and tablets, can experience the charm of picture generation application. However, due to the constraints of the complexity of the model, the lack of training data and other factors, the generated picture is often unsatisfactory and cannot achieve the effect of a real picture. Furthermore, each user has a different understanding of beauty, and picture generation tends to have many deficiencies in aesthetic aspects. Therefore, in order to make the quality of the generated image as good as possible and achieve the user expectation under the condition of improving the user experience, the invention needs a better method for generating the image at the picture level.

Graphics experts and computer scientists are constantly considering image generation issues. The biggest problem with generating models is that the machine has difficulty understanding the input data and, on the basis of this, also generates results that create the same distribution as the input data. Moreover, the user input is different from the training data, even the distribution is far away from the input data sometimes, and the machine is difficult to obtain a good generalization effect through the trained model.

Creating a countermeasure network with Goodfellow's proposal ^【1】 The advent of the Generic Adaptive Networks (GAN) has made a significant advance in the picture generation problem over the traditional methods. A generation countermeasure network comprises two neural networks, a generator and a discriminator, and the countermeasure learning of the two neural networks is improved day by day, so that the effect of approaching to real data is obtained. In recent years, many GAN-based methods have been proposed like spring shoots after rain, including the most basic DCGAN ^【2】 Conditional GAN ^【3】 And Pix2Pix ^【4】 And more complex CycleGAN ^【5】 And so on. Although the related research based on the GAN has achieved a good effect, the method is basically applied to learning the mapping relation between the image to be converted and the reference image, and the generated image is closer to a real image and more vivid.

By summarizing, the present invention can basically classify solutions to the image generation problem based on generation of countermeasure networks into three categories, direct, hierarchical and iterative, respectively. First, as the direct method is the most intuitive image generation method, a single generator and a single discriminator are used as a model, and there are no other branches. The early GAN models were essentially generic to this approach, such as GAN and DCGAN mentioned above. Their generator is basically composed of volumesLamination, batch normalization layer and ReLU activation function; unlike the direct method, the hierarchical method employs more than one pair of generators and discriminators to enhance the generation effect. The idea is to consider a picture as different parts, such as "texture" and "style", or "foreground" and "background". With SS-GAN ^【6】 For example, it divides its network part into two pairs of adversarial generation networks, Structure-GAN is responsible for generating better body Structure, and Style-GAN is responsible for generating picture Style; the iterative method is more special, a model of the iterative method comprises a plurality of pairs of generators and discriminators with the same or similar structures, and the generated images are more perfect through continuous refining in an iterative mode. Progressive-GAN ^【7】 Firstly, training a pair of generators and discriminators, starting from 4 × 4 resolution, and gradually increasing to 1024 × 1024 until generating a high-definition face image.

Disclosure of Invention

The invention aims to provide a photo-level image generation method based on semantic content and rapid image retrieval, which generates a network through countermeasure, so that a generated background image is vivid and is close to a real scene; and generating a foreground by adopting an image retrieval mode, and finally superposing the foreground and the background to obtain a result meeting the requirements of the user, thereby achieving the aesthetic affirmation of the user.

The technical scheme adopted by the invention is as follows:

a photo-level image generation method based on semantic content and rapid image retrieval comprises a background generation part and a foreground production part, and specifically comprises the following steps:

s1, the background generation part, including the steps of:

s1.1, acquiring a training data set for training a background picture generation model: selecting a large number of color images as a target I _G And determining the scene category by labeling to obtain a semantic segmentation chart I of each color picture _S Obtaining augmented picture data through mirroring and cutting respectively, and taking a matching data pair form as a training data set for deep learning;

s1.2, inputting the color images in the training data set into a coder network to execute a feature extraction stage, and reconstructing Gaussian distribution of a corresponding style;

s1.3, using Gaussian distribution obtained by the reconstruction of the encoder network as the input of a generator network, and assisting an inserted semantic graph to obtain and output enhanced semantic graph features;

s1.4, inputting the one-hot semantic graph and the splicing matrix of the output result of the generator network into the discriminator network to discriminate the real degree of the background generation, wherein the expression formula of the discriminator network is as follows:

I′ _O ＝ReLU((ReLU(W _d1 *concat(I _s ,F′)+B _d1 )×W _d2 +B _d2 )…×W _dn +B _dn ) (6)

wherein W _d1 、W _d2 、W _dn 、B _d1 、B _d2 And B _dn Respectively representing the weights and bias parameters of the first, second and nth convolutional layers, n being the number of convolutional layers, F' being the output of the generator, I _s Represents the output of the deconvolution stage;

s1.5, calculating and obtaining a loss function, specifically:

s1.5.1, comparing the generated image obtained by the generator with the corresponding original color image, and calculating a Perceptial Loss penalty function, wherein the penalty function is expressed as:

where j denotes the j-th layer of the network, C _j H _j W _j The size of the characteristic diagram of the j layer is shown, and phi represents a network;

s1.5.2, while using a hindeloss based loss function as the optimized loss for GAN, the loss function is:

wherein D represents a discriminator, G represents a generator, z is a hidden variable, x represents an input, and y is a target

S1.5.3, performing point-to-point loss calculation using MSE loss function;

s1.6, training a background picture generation model by adopting a progressive training strategy, dividing a training process into a plurality of preset sub-training periods, and sequentially training the sub-training periods by adopting a step growth strategy; scaling the original image into small pictures and starting training at a large learning rate when training is started, and gradually increasing the color original image and gradually reducing the learning rate after each sub-training period is finished;

when the color image generated after finishing one sub-training period is compared with the corresponding original color image, the preset reconstruction effect is not obtained, the reverse propagation is continued, the gradient descent optimization algorithm is used for updating the convolution weight parameter and the bias parameter, and S1.2 is executed again; when the color image generated after completing one sub-training period reaches the expected number or completes all the preset sub-training periods, obtaining the final result;

s2, the foreground generating part comprises the following steps:

s2.1, preparing a foreground generation data set, and selecting a large number of scribble data sets as training data for training a scribble recognition model;

carrying out mirror image and cutting operation on the scrawling original image to obtain a large amount of augmented picture data, and storing the scrawling data into a corresponding folder according to the label;

s2.2, using computer vision to perform blocking processing on the scrawling data: carrying out corrosion and expansion operation processing on the pictures of the scrawling data, calculating the number of blocks of corresponding objects in the pictures through communicating elements, recording the central point position of each block, respectively zooming the blocks to a fixed scale,

and S2.3, inputting the blocked pictures into an identification network of the depth model to identify the scrawling, indexing the pictures from a corresponding picture database through the obtained labels according to a feedback result of the identification network, and obtaining the picture with the highest approximation degree by adopting a nearest neighbor model and fusing the picture to the corresponding position of the background.

Further, the initial size of the color image in S1.1 is 3 × 256 × 256, which corresponds to the color channel, the picture width and the picture height, respectively; the initial size of the semantic segmentation graph is 1 multiplied by 256, and one dimension is adopted to store label information.

Further, the feature extraction stage in S1.2 includes four large convolutional layers, each large convolutional layer includes a convolutional layer, a batch regularization layer, and a ReLU activation function, and a calculation formula of each large convolutional layer is:

F＝ReLU{BN{W ₁ ×I _g +B ₁ }} (1)

wherein g represents a non-linear activation function, W ₁ And B ₁ Respectively representing the weight and bias of convolution layer in the feature extraction stage, BN representing batch normalization function, I _g The input picture is shown, and the output result obtained in the characteristic extraction stage is shown as F.

Further, the output of the feature extraction stage in S1.2 is output to two full connection layers through a remodeling module, respectively, to obtain a mean and variance of gaussian distribution.

Further, the loss function used by the encoder network of S1.2 is KL Divergence, which is expressed by the following formula:

L _KLD ＝D _KL (q(z|x)||p(z)) (2)

wherein p and q are both standard Gaussian distributions.

Further, the generator in S1.3 comprises more than two custom regularized residual error unit blocks; each regularization residual unit comprises a self-defined regularization block, random Gaussian noise is added into the input end of the self-defined regularization block, the one-hot operation is carried out on the inserted semantic graph in an auxiliary mode to obtain a matrix with the dimension being the same as the quantity of semantic labels, the matrix is scaled to the size being the same as the input size and then sequentially passes through a convolution kernel and a ReLU activation function, finally, the weight and the variance corresponding to the semantic graph are obtained through the two convolution kernels respectively, and the final result is obtained after the operation is carried out on the input of the inserted noise.

Further, the formula of the custom regular block in S1.3 is expressed as:

F _s ＝ReLU(Resize(I _s )×W ₁ +b ₁ ) (2)

W _s ＝W ₂ ×F _s +b ₂ (3)

b _s ＝W ₃ ×F _s +b ₃ (4)

F′＝concat(I _g ,Noise)×W _s +b _s (5)

wherein, W ₁ 、b ₁ Represents the weight and offset, W, through the first convolutional layer ₂ 、b ₂ Representing the weights and offsets convolved by the calculated weights, W ₃ 、b ₃ Representing weights and offsets convolved by a calculation offset, F _s Representing the middle process, and F' represents the output of the custom regular block.

Further, the regularization residual unit in S1.3 comprises two convolutional layers and a jump connection, the output of the generator network is connected with the output of the feature extraction layer through the jump connection,

further, the scribble data in S2.1 is divided into 30 categories.

Further, the graffiti recognition model in S2.3 is an inverted residual structure.

By adopting the technical scheme, the generated background picture is vivid and close to a real scene through the confrontation generation network; and generating a foreground by adopting an image retrieval mode, and finally superposing the foreground and the background to obtain a result meeting the requirements of the user, thereby achieving the aesthetic affirmation of the user.

Drawings

The invention is described in further detail below with reference to the figures and the detailed description of the invention

FIG. 1 is a schematic diagram of an encoder network configuration of the present invention;

FIG. 2 is a schematic diagram of a network structure of a custom regular block of the present invention;

FIG. 3 is a schematic diagram of a network structure of a regularized residual unit of the present invention;

FIG. 4 is a schematic diagram of a generator network architecture of the present invention;

FIG. 5 is a schematic diagram of a discriminator network according to the present invention;

FIG. 6 is a general architectural diagram of the background generation portion of the present invention;

FIG. 7 is a comparison of the background generation portion of the present invention;

FIG. 8 is a diagram of the effect of the background fusion foreground of the present invention.

Detailed Description

As shown in one of fig. 1 to 8, the present invention discloses a photo-level image generation method based on semantic content and fast image retrieval, which includes a background generation part and a foreground production part, and the specific steps are as follows:

s1, a background generation part, the overall architecture is as shown in fig. 6. The method comprises the following steps:

s1.1, selecting a large number of color images as a target I for training a picture generation model _G And determining scene category by marking to obtain semantic segmentation graph I of each color picture _S . A large amount of augmented picture data are obtained by carrying out mirror image and cutting operation on the color original image and the semantic segmentation image, and are stored in a mode of matching data pairs to serve as a training data set for deep learning. The initial sizes of the color pictures are (3 multiplied by 256), and the initial sizes correspond to color channels, the widths and the heights of the pictures respectively; the semantic picture has an initial size of (1 × 256 × 256), and only one dimension needs to be adopted to store tag information.

S1.2, encoder section. And inputting the color images in the training data set into an encoder network to execute a characteristic extraction stage, and finally reconstructing the corresponding style distribution. The details of the steps are as follows:

the feature extraction stage consists of four large convolution layers, including a convolution layer of 3 × 3 convolution kernels, a batch regularization layer and a ReLU activation function, wherein the calculation formula of one large convolution block is as follows:

F＝ReLU{BN{W ₁ ×I _g +B ₁ }} (1)

wherein g represents a non-linear activation function, W ₁ ，B ₁ Respectively representing convolution in the feature extraction stageWeights and offsets of layers, BN stands for batch normalization function, I _g Representing the input picture and F the output result obtained in the feature extraction stage. Then, after passing through four large convolution blocks, passing through a reshaping module (Reshape), and then respectively passing through two full-connected layers, a mean value and a variance of gaussian distribution are obtained. It is believed that the mean and variance herein represent a stylized gaussian distribution. The encoder portion is shown in fig. 1. The loss function used by the encoder section is KL diversity, which is:

L _KLD ＝D _KL (q(z|x)||p(z)) (2)

wherein p and q are both standard Gaussian distributions.

S1.3, generator part. The generator part is the most important part of the whole generation network, and aims to take the random distribution obtained by the encoder as input so as to continuously assist the inserted semantic graph to enhance the semantic graph characteristics. The generator mainly comprises a self-defined regularization residual error unit block; each regularization residual unit also comprises a self-defined regularization block. In each custom regular block, random gaussian noise is added to the input of the regular block in order to generate richer textures. Meanwhile, the auxiliary inserted semantic graph is subjected to one-hot operation to obtain a matrix with the same dimensionality as the semantic label quantity, Resize is conducted to the matrix with the same size as the input, then a convolution kernel of 3 x 3 and a ReLU activation function are conducted, finally, the weight and the variance corresponding to the semantic graph are obtained through two convolution kernels respectively, and the final result is obtained after operation is conducted on the semantic graph and the inserted noise input. The custom regular block is shown in fig. 2, and its formula is:

F _s ＝ReLU(Resize(I _s )×W ₁ +b ₁ ) (2)

W _s ＝W ₂ ×F _s +b ₂ (3)

b _s ＝W ₃ ×F _s +b ₃ (4)

F′＝concat(I _g ,Noise)×W _s +b _s (5)

wherein, W ₁ 、b ₁ Represents passing throughWeight and offset of one convolutional layer, W ₂ 、b ₂ Representing the weights and offsets convolved by the calculated weights, W ₃ 、b ₃ Representing the weights and offsets of the calculated offset convolution, F _s Representing the intermediate process, and F' represents the output of the custom regular block. The custom regular block is similar to batch regularization, but the difference is that the batch regularization is to process channels, which is not enough to reconstruct the whole picture, and the custom regular block can be reconstructed by kyoto at a pixel level, so that the generated image can be more accurate and fine.

In the regularization residual unit, as shown in fig. 3, a plurality of custom regularization blocks are included. The regularization residual unit is composed of two convolutional layers and a jump connection. The output of this layer is connected to the output of the feature extraction layer by a jump-connection, avoiding gradient dispersion and enhancing the information by preserving the original features. Different from the common residual error unit, the batch regularization layer is replaced by the self-defined regularization block, so that the effect can be achieved. To this end, the entire background generator network portion is shown in FIG. 4.

S1.4, a discriminator section. As shown in fig. 5, in order to determine the authenticity of background generation, a discriminator model of Pix2PixHD is used for reference, and an improvement is made on the basis of the discriminator model. The input of the network is a semantic graph in a one-hot form and a generator generates a jointed matrix. The network is mainly based on a volume sum and a ReLU activation function, and the formula is as follows:

wherein W _d1 ，W _d2 ，W _dn ,B _d1 ,B _d2 And B _dn Representing the weights and bias parameters of the first, second and nth convolutional layers, respectively, F' being the output of the generator, I _s Representing the output of the deconvolution stage. Here, n is set to 4.

S1.5, calculating a loss function. And (3) comparing the generated image obtained by the generator in the step (3) with the original color image corresponding to the step (1) and calculating a Perceptual Loss penalty function. The loss function can be expressed as:

where j denotes the j-th layer of the network, C _j H _j W _j Showing the size of the feature map of the j-th layer. Loss networks use a VGG16 network trained on ImageNet, denoted phi.

Meanwhile, a HingeLoss-based loss function is used as the optimized loss of the GAN, and the loss function is as follows:

furthermore, a simpler MSE loss function is used to perform point-to-point loss calculation.

And S1.6, training the picture generation model by adopting a progressive training strategy. Dividing the training process into a plurality of preset sub-training periods, and adopting a stepping increasing strategy to train the sub-training periods in sequence; scaling the original image into small pictures and starting training at a large learning rate when training is started, and gradually increasing the color original image and gradually reducing the learning rate after each sub-training period is finished;

when the color image generated after completing one sub-training period is compared with the corresponding original color image, the preset reconstruction effect is not obtained, the reverse propagation is continued, the gradient descent optimization algorithm is utilized to update the convolution weight parameters and the bias parameters, and then the step 2 is executed; when the color image generated after completing one sub-training period reaches the expected number or completes all the preset sub-training periods, the final result is obtained. The reason for this is to start training based on scaling the original picture to a small picture, and to assist with the university's rate of learning. After the training period is finished, the input picture is increased, and the learning rate is reduced to perform training. By analogy, the precision of the picture with higher resolution can be enhanced on the basis of the picture with low resolution, and the distortion and the unreasonable color effect caused by the generation of the convolutional network are reduced.

As shown in fig. 7, the effect of the background generated by the method of the present invention is compared with the background generated by other prior art.

S2, the foreground generating part comprises the following steps:

s2.1, foreground generation data set preparation. A data preparation section. In order to train the graffiti recognition model, a large number of graffiti data sets are selected as training data. Adopts Sketchy Dataset published on the network ^【8】 As training data. A large amount of augmented picture data are obtained by carrying out mirror image and cutting operation on the scrawling original image. And storing the scrawling data into a corresponding folder according to the label. The initial size of the original graffiti picture is (1 × 256 × 256), and the initial size corresponds to a color channel, a picture width and a picture height. The graffiti data is divided into a total of 30 categories in a subsequent classification process.

And S2.2, using a computer to visually process the scrawling. And recognizing the input image by adopting a computer vision method. Since the input of one image may contain a plurality of objects, the image needs to be input into the network after being divided into blocks. The blocking operation needs to perform corrosion and expansion operation processing on the picture, wherein the corrosion iteration coefficient is 5, and the expansion iteration coefficient is 3. After that, how many blocks are in the recording chart is calculated by the connected elements, the central point position of each block is recorded, and the blocks are respectively scaled to (1 × 256 × 256). And finally, inputting the processed picture into a depth model for identification.

And S2.3, using the depth model to perform scribble recognition. And identifying the doodle data by adopting an identification network. The identification network architecture is based on an inverted residual error structure, the main branch of the original residual error structure has three convolutions, the number of two point-by-point convolution channels is large, the inverted residual error structure is just opposite, the number of the middle convolution channels (the depth separation convolution structure is still used) is large, and the number of the middle convolution channels is small. Furthermore, it was found to be effective to remove the non-linear transformations in the main branches, which may preserve model expressiveness. Using cross entropy as a loss function, a comprehensive accuracy of 81% can be achieved after training 70 epochs. According to the feedback result of the recognition model, the obtained labels are indexed in the corresponding picture database, the nearest neighbor model is adopted to obtain the picture with the highest approximation degree, and finally the picture is fused to the corresponding position of the background, and the final effect is shown in fig. 8.

By adopting the technical scheme, the generated background picture is vivid and is close to a real scene through the confrontation generation network; and generating a foreground by adopting an image retrieval mode, and finally superposing the foreground and the background to obtain a result meeting the requirements of the user, thereby achieving the aesthetic affirmation of the user.

Reference documents:

1.I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.Courville,and Y.Bengio.Generative adversarial nets.In Advances in neural information processing systems,pages 2672–2680,2014.

2.A.Radford,L.Metz,and S.Chintala.Unsupervised representation learning with deep convolutional generative adversarial networks.arXiv preprint arXiv:1511.06434,2015.

3.M.Mirza and S.Osindero.Conditional generative adversarial nets.2014.

4.P.Isola,J.-Y.Zhu,T.Zhou,and A.A.Efros.Image-to-image translation with conditional adversarial net-works.arXiv preprint arXiv:1611.07004,2016.

5.Zhu JY,Park T,Isola P,Efros AA.Unpaired image-to-image translation using cycle-consistent adversarial networks.arXiv preprint.2017.

6.Wang X,Gupta A.Generative image modeling using style and structure adversarial networks[C]//European Conference on Computer Vision.Springer,Cham,2016:318-335.

7.Karras T,Aila T,Laine S,et al.Progressive growing of gans for improved quality,stability,and variation[J].arXiv preprint arXiv:1710.10196,2017.

8.Sangkloy,Patsorn,et al."The sketchy database:learning to retrieve badly drawn bunnies."ACM Transactions on Graphics(TOG)35.4(2016):119.

Claims

1. a photo-level image generation method based on semantic content and rapid image retrieval is characterized in that: the method comprises a background generation part and a foreground production part, and comprises the following specific steps:

s1, the background generating part, comprising the steps of:

I′ _O ＝ReLU((ReLU(W _d1 *concat(I _s ,F′)+B _d1 )×W _d2 +B _d2 )…×W _dn +B _dn )

s1.5, calculating and obtaining a loss function, specifically:

s1.5.1, comparing the generated image obtained by the generator with the corresponding original color image, and calculating a Percentual Loss penalty function, wherein the penalty function is expressed as:

wherein D represents a discriminator, G represents a generator, z is a hidden variable, x represents input, and y is a target;

s1.5.3, performing point-to-point loss calculation using MSE loss function;

s1.6, training a background picture generation model by adopting a progressive training strategy, dividing a training process into a plurality of preset sub-training periods, and sequentially training the sub-training periods by adopting a step growth strategy; scaling the original image into small pictures and starting training at a large learning rate when training is started, and gradually increasing the color original image and gradually reducing the learning rate after each sub-training period is finished; when the color image generated after finishing one sub-training period is compared with the corresponding original color image, the preset reconstruction effect is not obtained, the reverse propagation is continued, the gradient descent optimization algorithm is used for updating the convolution weight parameter and the bias parameter, and S1.2 is executed again; when the color image generated after completing one sub-training period reaches the expected number or completes all the preset sub-training periods, obtaining the final result;

s2, the foreground generating part comprises the following steps:

s2.1, preparing a foreground generation data set, and selecting a large number of doodle data sets as training data for training a doodle recognition model; a large amount of augmented picture data are obtained by carrying out mirror image and cutting operation on the original scrawling image, and then the scrawling data are stored into a corresponding folder according to the label;

and S2.3, inputting the blocked pictures into an identification network of the depth model to identify the graffiti, indexing the pictures from a corresponding picture database through the obtained labels according to a feedback result of the identification network, and obtaining the picture with the highest approximation degree by adopting a nearest neighbor model and fusing the picture to a corresponding position of the background.

2. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: s1.1, the initial sizes of the color images are all 3 multiplied by 256, and the initial sizes correspond to color channels, the widths of the images and the heights of the images respectively; the initial size of the semantic segmentation graph is 1 multiplied by 256, and one dimension is adopted to store label information.

3. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: the feature extraction stage in S1.2 includes four large convolutional layers, each large convolutional layer includes a convolutional layer, a batch regularization layer and a ReLU activation function, and the calculation formula of each large convolutional layer is as follows:

F＝ReLU{BN{W ₁ ×I _g +B ₁ }} (1)

wherein g represents a non-linear activation function, W ₁ And B ₁ Respectively representing convolutional layers in a feature extraction stageBN represents a batch normalization function, I _g Representing the input picture and F the output result obtained in the feature extraction stage.

4. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: and S1.2, respectively outputting the output of the feature extraction stage to two full-connection layers through a remodeling module to obtain a mean value and a variance of Gaussian distribution.

5. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: the loss function used by the encoder network of S1.2 is KL Divergence, whose formula is:

L _KLD ＝D _KL (q(z|x)||p(z)) (2)

wherein p and q are both standard Gaussian distributions.

6. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: the generator in S1.3 comprises more than two self-defined regularized residual error unit blocks; each regularization residual error unit comprises a self-defined regularization block; random Gaussian noise is added into the input end of the custom regular block, one-hot operation is carried out on the inserted semantic graph in an auxiliary mode to obtain a matrix with the dimension being the same as the number of semantic labels, Resize of the matrix is set to be the same as the input size, then a convolution kernel and a ReLU activation function are carried out, finally, weight and variance corresponding to the semantic graph are obtained through the two convolution kernels respectively, and a final result is obtained after operation is carried out on the matrix and the inserted noise input.

7. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 6, wherein: the formula of the custom regular block in S1.3 is expressed as:

F _s ＝ReLU(Resize(I _s )×W ₁ +b ₁ ) (2)

W _s ＝W ₂ ×F _s +b ₂ (3)

b _s ＝W ₃ ×F _s +b ₃ (4)

F′＝concat(I _g ,Noise)×W _s +b _s (5)

wherein, W ₁ 、b ₁ Representing the weight and offset, W, through the first convolutional layer ₂ 、b ₂ Representing the weights and offsets convolved by the calculated weights, W ₃ 、b ₃ Representing the weights and offsets of the calculated offset convolution, F _s Representing the intermediate process, and F' represents the output of the custom regular block.

8. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 6, wherein: and S1.3, the regularization residual error unit comprises two convolution layers and a jump connection, and the output of the generator network is connected with the output of the feature extraction layer through the jump connection.

9. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: the scribble data in S2.1 is divided into 30 categories.

10. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: s2.3, the doodle recognition model is an inverted residual error structure.