CN110634170B - Photo-level image generation method based on semantic content and rapid image retrieval - Google Patents

Photo-level image generation method based on semantic content and rapid image retrieval Download PDF

Info

Publication number
CN110634170B
CN110634170B CN201910813199.2A CN201910813199A CN110634170B CN 110634170 B CN110634170 B CN 110634170B CN 201910813199 A CN201910813199 A CN 201910813199A CN 110634170 B CN110634170 B CN 110634170B
Authority
CN
China
Prior art keywords
training
picture
network
photo
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910813199.2A
Other languages
Chinese (zh)
Other versions
CN110634170A (en
Inventor
薛雨阳
浦佳祺
薛裕明
李�根
童同
高钦泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Imperial Vision Information Technology Co ltd
Original Assignee
Fujian Imperial Vision Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Imperial Vision Information Technology Co ltd filed Critical Fujian Imperial Vision Information Technology Co ltd
Priority to CN201910813199.2A priority Critical patent/CN110634170B/en
Publication of CN110634170A publication Critical patent/CN110634170A/en
Application granted granted Critical
Publication of CN110634170B publication Critical patent/CN110634170B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a photo-level image generation method based on semantic content and rapid image retrieval, which comprises a background generation part and a foreground production part, wherein the background generation part generates a network through countermeasure, so that a generated background picture is vivid and is close to a real scene; the foreground generation part uses computer vision to perform blocking processing on the doodle data and then inputs the doodle data into an identification network of a depth model to perform doodle identification, indexes the doodle data from a corresponding picture database through the obtained labels according to a feedback result of the identification network, obtains a picture with the highest approximation degree by adopting a nearest neighbor model and fuses the picture to a corresponding position of a background, generates a foreground by adopting an image retrieval mode, and finally superposes the foreground and the background, so that the generated picture is more real and perfect.

Description

Photo-level image generation method based on semantic content and rapid image retrieval
Technical Field
The invention relates to the field of image generation and image retrieval, in particular to a method for generating a live-action picture based on a semantic graph and image retrieval.
Background
The image generation based on the deep learning is a newer artificial intelligence technology, aiming at simple user interaction and priori knowledge of images, and the artificial intelligence technology is adopted to automatically generate corresponding pictures according to a composition mode drawn by a user. With the rise of the countermeasure generation network, image and video content generation has been popularized in the hands of general users. At present, not only through computers, but also mobile devices based on single-chip microcomputers, such as mobile phones and tablets, can experience the charm of picture generation application. However, due to the constraints of the complexity of the model, the lack of training data and other factors, the generated picture is often unsatisfactory and cannot achieve the effect of a real picture. Furthermore, each user has a different understanding of beauty, and picture generation tends to have many deficiencies in aesthetic aspects. Therefore, in order to make the quality of the generated image as good as possible and achieve the user expectation under the condition of improving the user experience, the invention needs a better method for generating the image at the picture level.
Graphics experts and computer scientists are constantly considering image generation issues. The biggest problem with generating models is that the machine has difficulty understanding the input data and, on the basis of this, also generates results that create the same distribution as the input data. Moreover, the user input is different from the training data, even the distribution is far away from the input data sometimes, and the machine is difficult to obtain a good generalization effect through the trained model.
Creating a countermeasure network with Goodfellow's proposal 【1】 The advent of the Generic Adaptive Networks (GAN) has made a significant advance in the picture generation problem over the traditional methods. A generation countermeasure network comprises two neural networks, a generator and a discriminator, and the countermeasure learning of the two neural networks is improved day by day, so that the effect of approaching to real data is obtained. In recent years, many GAN-based methods have been proposed like spring shoots after rain, including the most basic DCGAN 【2】 Conditional GAN 【3】 And Pix2Pix 【4】 And more complex CycleGAN 【5】 And so on. Although the related research based on the GAN has achieved a good effect, the method is basically applied to learning the mapping relation between the image to be converted and the reference image, and the generated image is closer to a real image and more vivid.
By summarizing, the present invention can basically classify solutions to the image generation problem based on generation of countermeasure networks into three categories, direct, hierarchical and iterative, respectively. First, as the direct method is the most intuitive image generation method, a single generator and a single discriminator are used as a model, and there are no other branches. The early GAN models were essentially generic to this approach, such as GAN and DCGAN mentioned above. Their generator is basically composed of volumesLamination, batch normalization layer and ReLU activation function; unlike the direct method, the hierarchical method employs more than one pair of generators and discriminators to enhance the generation effect. The idea is to consider a picture as different parts, such as "texture" and "style", or "foreground" and "background". With SS-GAN 【6】 For example, it divides its network part into two pairs of adversarial generation networks, Structure-GAN is responsible for generating better body Structure, and Style-GAN is responsible for generating picture Style; the iterative method is more special, a model of the iterative method comprises a plurality of pairs of generators and discriminators with the same or similar structures, and the generated images are more perfect through continuous refining in an iterative mode. Progressive-GAN 【7】 Firstly, training a pair of generators and discriminators, starting from 4 × 4 resolution, and gradually increasing to 1024 × 1024 until generating a high-definition face image.
Disclosure of Invention
The invention aims to provide a photo-level image generation method based on semantic content and rapid image retrieval, which generates a network through countermeasure, so that a generated background image is vivid and is close to a real scene; and generating a foreground by adopting an image retrieval mode, and finally superposing the foreground and the background to obtain a result meeting the requirements of the user, thereby achieving the aesthetic affirmation of the user.
The technical scheme adopted by the invention is as follows:
a photo-level image generation method based on semantic content and rapid image retrieval comprises a background generation part and a foreground production part, and specifically comprises the following steps:
s1, the background generation part, including the steps of:
s1.1, acquiring a training data set for training a background picture generation model: selecting a large number of color images as a target I G And determining the scene category by labeling to obtain a semantic segmentation chart I of each color picture S Obtaining augmented picture data through mirroring and cutting respectively, and taking a matching data pair form as a training data set for deep learning;
s1.2, inputting the color images in the training data set into a coder network to execute a feature extraction stage, and reconstructing Gaussian distribution of a corresponding style;
s1.3, using Gaussian distribution obtained by the reconstruction of the encoder network as the input of a generator network, and assisting an inserted semantic graph to obtain and output enhanced semantic graph features;
s1.4, inputting the one-hot semantic graph and the splicing matrix of the output result of the generator network into the discriminator network to discriminate the real degree of the background generation, wherein the expression formula of the discriminator network is as follows:
I′ O =ReLU((ReLU(W d1 *concat(I s ,F′)+B d1 )×W d2 +B d2 )…×W dn +B dn ) (6)
wherein W d1 、W d2 、W dn 、B d1 、B d2 And B dn Respectively representing the weights and bias parameters of the first, second and nth convolutional layers, n being the number of convolutional layers, F' being the output of the generator, I s Represents the output of the deconvolution stage;
s1.5, calculating and obtaining a loss function, specifically:
s1.5.1, comparing the generated image obtained by the generator with the corresponding original color image, and calculating a Perceptial Loss penalty function, wherein the penalty function is expressed as:
Figure BDA0002185606540000021
where j denotes the j-th layer of the network, C j H j W j The size of the characteristic diagram of the j layer is shown, and phi represents a network;
s1.5.2, while using a hindeloss based loss function as the optimized loss for GAN, the loss function is:
Figure BDA0002185606540000031
Figure BDA0002185606540000032
wherein D represents a discriminator, G represents a generator, z is a hidden variable, x represents an input, and y is a target
S1.5.3, performing point-to-point loss calculation using MSE loss function;
s1.6, training a background picture generation model by adopting a progressive training strategy, dividing a training process into a plurality of preset sub-training periods, and sequentially training the sub-training periods by adopting a step growth strategy; scaling the original image into small pictures and starting training at a large learning rate when training is started, and gradually increasing the color original image and gradually reducing the learning rate after each sub-training period is finished;
when the color image generated after finishing one sub-training period is compared with the corresponding original color image, the preset reconstruction effect is not obtained, the reverse propagation is continued, the gradient descent optimization algorithm is used for updating the convolution weight parameter and the bias parameter, and S1.2 is executed again; when the color image generated after completing one sub-training period reaches the expected number or completes all the preset sub-training periods, obtaining the final result;
s2, the foreground generating part comprises the following steps:
s2.1, preparing a foreground generation data set, and selecting a large number of scribble data sets as training data for training a scribble recognition model;
carrying out mirror image and cutting operation on the scrawling original image to obtain a large amount of augmented picture data, and storing the scrawling data into a corresponding folder according to the label;
s2.2, using computer vision to perform blocking processing on the scrawling data: carrying out corrosion and expansion operation processing on the pictures of the scrawling data, calculating the number of blocks of corresponding objects in the pictures through communicating elements, recording the central point position of each block, respectively zooming the blocks to a fixed scale,
and S2.3, inputting the blocked pictures into an identification network of the depth model to identify the scrawling, indexing the pictures from a corresponding picture database through the obtained labels according to a feedback result of the identification network, and obtaining the picture with the highest approximation degree by adopting a nearest neighbor model and fusing the picture to the corresponding position of the background.
Further, the initial size of the color image in S1.1 is 3 × 256 × 256, which corresponds to the color channel, the picture width and the picture height, respectively; the initial size of the semantic segmentation graph is 1 multiplied by 256, and one dimension is adopted to store label information.
Further, the feature extraction stage in S1.2 includes four large convolutional layers, each large convolutional layer includes a convolutional layer, a batch regularization layer, and a ReLU activation function, and a calculation formula of each large convolutional layer is:
F=ReLU{BN{W 1 ×I g +B 1 }} (1)
wherein g represents a non-linear activation function, W 1 And B 1 Respectively representing the weight and bias of convolution layer in the feature extraction stage, BN representing batch normalization function, I g The input picture is shown, and the output result obtained in the characteristic extraction stage is shown as F.
Further, the output of the feature extraction stage in S1.2 is output to two full connection layers through a remodeling module, respectively, to obtain a mean and variance of gaussian distribution.
Further, the loss function used by the encoder network of S1.2 is KL Divergence, which is expressed by the following formula:
L KLD =D KL (q(z|x)||p(z)) (2)
wherein p and q are both standard Gaussian distributions.
Further, the generator in S1.3 comprises more than two custom regularized residual error unit blocks; each regularization residual unit comprises a self-defined regularization block, random Gaussian noise is added into the input end of the self-defined regularization block, the one-hot operation is carried out on the inserted semantic graph in an auxiliary mode to obtain a matrix with the dimension being the same as the quantity of semantic labels, the matrix is scaled to the size being the same as the input size and then sequentially passes through a convolution kernel and a ReLU activation function, finally, the weight and the variance corresponding to the semantic graph are obtained through the two convolution kernels respectively, and the final result is obtained after the operation is carried out on the input of the inserted noise.
Further, the formula of the custom regular block in S1.3 is expressed as:
F s =ReLU(Resize(I s )×W 1 +b 1 ) (2)
W s =W 2 ×F s +b 2 (3)
b s =W 3 ×F s +b 3 (4)
F′=concat(I g ,Noise)×W s +b s (5)
wherein, W 1 、b 1 Represents the weight and offset, W, through the first convolutional layer 2 、b 2 Representing the weights and offsets convolved by the calculated weights, W 3 、b 3 Representing weights and offsets convolved by a calculation offset, F s Representing the middle process, and F' represents the output of the custom regular block.
Further, the regularization residual unit in S1.3 comprises two convolutional layers and a jump connection, the output of the generator network is connected with the output of the feature extraction layer through the jump connection,
further, the scribble data in S2.1 is divided into 30 categories.
Further, the graffiti recognition model in S2.3 is an inverted residual structure.
By adopting the technical scheme, the generated background picture is vivid and close to a real scene through the confrontation generation network; and generating a foreground by adopting an image retrieval mode, and finally superposing the foreground and the background to obtain a result meeting the requirements of the user, thereby achieving the aesthetic affirmation of the user.
Drawings
The invention is described in further detail below with reference to the figures and the detailed description of the invention
FIG. 1 is a schematic diagram of an encoder network configuration of the present invention;
FIG. 2 is a schematic diagram of a network structure of a custom regular block of the present invention;
FIG. 3 is a schematic diagram of a network structure of a regularized residual unit of the present invention;
FIG. 4 is a schematic diagram of a generator network architecture of the present invention;
FIG. 5 is a schematic diagram of a discriminator network according to the present invention;
FIG. 6 is a general architectural diagram of the background generation portion of the present invention;
FIG. 7 is a comparison of the background generation portion of the present invention;
FIG. 8 is a diagram of the effect of the background fusion foreground of the present invention.
Detailed Description
As shown in one of fig. 1 to 8, the present invention discloses a photo-level image generation method based on semantic content and fast image retrieval, which includes a background generation part and a foreground production part, and the specific steps are as follows:
s1, a background generation part, the overall architecture is as shown in fig. 6. The method comprises the following steps:
s1.1, selecting a large number of color images as a target I for training a picture generation model G And determining scene category by marking to obtain semantic segmentation graph I of each color picture S . A large amount of augmented picture data are obtained by carrying out mirror image and cutting operation on the color original image and the semantic segmentation image, and are stored in a mode of matching data pairs to serve as a training data set for deep learning. The initial sizes of the color pictures are (3 multiplied by 256), and the initial sizes correspond to color channels, the widths and the heights of the pictures respectively; the semantic picture has an initial size of (1 × 256 × 256), and only one dimension needs to be adopted to store tag information.
S1.2, encoder section. And inputting the color images in the training data set into an encoder network to execute a characteristic extraction stage, and finally reconstructing the corresponding style distribution. The details of the steps are as follows:
the feature extraction stage consists of four large convolution layers, including a convolution layer of 3 × 3 convolution kernels, a batch regularization layer and a ReLU activation function, wherein the calculation formula of one large convolution block is as follows:
F=ReLU{BN{W 1 ×I g +B 1 }} (1)
wherein g represents a non-linear activation function, W 1 ,B 1 Respectively representing convolution in the feature extraction stageWeights and offsets of layers, BN stands for batch normalization function, I g Representing the input picture and F the output result obtained in the feature extraction stage. Then, after passing through four large convolution blocks, passing through a reshaping module (Reshape), and then respectively passing through two full-connected layers, a mean value and a variance of gaussian distribution are obtained. It is believed that the mean and variance herein represent a stylized gaussian distribution. The encoder portion is shown in fig. 1. The loss function used by the encoder section is KL diversity, which is:
L KLD =D KL (q(z|x)||p(z)) (2)
wherein p and q are both standard Gaussian distributions.
S1.3, generator part. The generator part is the most important part of the whole generation network, and aims to take the random distribution obtained by the encoder as input so as to continuously assist the inserted semantic graph to enhance the semantic graph characteristics. The generator mainly comprises a self-defined regularization residual error unit block; each regularization residual unit also comprises a self-defined regularization block. In each custom regular block, random gaussian noise is added to the input of the regular block in order to generate richer textures. Meanwhile, the auxiliary inserted semantic graph is subjected to one-hot operation to obtain a matrix with the same dimensionality as the semantic label quantity, Resize is conducted to the matrix with the same size as the input, then a convolution kernel of 3 x 3 and a ReLU activation function are conducted, finally, the weight and the variance corresponding to the semantic graph are obtained through two convolution kernels respectively, and the final result is obtained after operation is conducted on the semantic graph and the inserted noise input. The custom regular block is shown in fig. 2, and its formula is:
F s =ReLU(Resize(I s )×W 1 +b 1 ) (2)
W s =W 2 ×F s +b 2 (3)
b s =W 3 ×F s +b 3 (4)
F′=concat(I g ,Noise)×W s +b s (5)
wherein, W 1 、b 1 Represents passing throughWeight and offset of one convolutional layer, W 2 、b 2 Representing the weights and offsets convolved by the calculated weights, W 3 、b 3 Representing the weights and offsets of the calculated offset convolution, F s Representing the intermediate process, and F' represents the output of the custom regular block. The custom regular block is similar to batch regularization, but the difference is that the batch regularization is to process channels, which is not enough to reconstruct the whole picture, and the custom regular block can be reconstructed by kyoto at a pixel level, so that the generated image can be more accurate and fine.
In the regularization residual unit, as shown in fig. 3, a plurality of custom regularization blocks are included. The regularization residual unit is composed of two convolutional layers and a jump connection. The output of this layer is connected to the output of the feature extraction layer by a jump-connection, avoiding gradient dispersion and enhancing the information by preserving the original features. Different from the common residual error unit, the batch regularization layer is replaced by the self-defined regularization block, so that the effect can be achieved. To this end, the entire background generator network portion is shown in FIG. 4.
S1.4, a discriminator section. As shown in fig. 5, in order to determine the authenticity of background generation, a discriminator model of Pix2PixHD is used for reference, and an improvement is made on the basis of the discriminator model. The input of the network is a semantic graph in a one-hot form and a generator generates a jointed matrix. The network is mainly based on a volume sum and a ReLU activation function, and the formula is as follows:
I′ O =ReLU((ReLU(W d1 *concat(I s ,F′)+B d1 )×W d2 +B d2 )…×W dn +B dn ) (6)
wherein W d1 ,W d2 ,W dn ,B d1 ,B d2 And B dn Representing the weights and bias parameters of the first, second and nth convolutional layers, respectively, F' being the output of the generator, I s Representing the output of the deconvolution stage. Here, n is set to 4.
S1.5, calculating a loss function. And (3) comparing the generated image obtained by the generator in the step (3) with the original color image corresponding to the step (1) and calculating a Perceptual Loss penalty function. The loss function can be expressed as:
Figure BDA0002185606540000061
where j denotes the j-th layer of the network, C j H j W j Showing the size of the feature map of the j-th layer. Loss networks use a VGG16 network trained on ImageNet, denoted phi.
Meanwhile, a HingeLoss-based loss function is used as the optimized loss of the GAN, and the loss function is as follows:
Figure BDA0002185606540000071
Figure BDA0002185606540000072
furthermore, a simpler MSE loss function is used to perform point-to-point loss calculation.
And S1.6, training the picture generation model by adopting a progressive training strategy. Dividing the training process into a plurality of preset sub-training periods, and adopting a stepping increasing strategy to train the sub-training periods in sequence; scaling the original image into small pictures and starting training at a large learning rate when training is started, and gradually increasing the color original image and gradually reducing the learning rate after each sub-training period is finished;
when the color image generated after completing one sub-training period is compared with the corresponding original color image, the preset reconstruction effect is not obtained, the reverse propagation is continued, the gradient descent optimization algorithm is utilized to update the convolution weight parameters and the bias parameters, and then the step 2 is executed; when the color image generated after completing one sub-training period reaches the expected number or completes all the preset sub-training periods, the final result is obtained. The reason for this is to start training based on scaling the original picture to a small picture, and to assist with the university's rate of learning. After the training period is finished, the input picture is increased, and the learning rate is reduced to perform training. By analogy, the precision of the picture with higher resolution can be enhanced on the basis of the picture with low resolution, and the distortion and the unreasonable color effect caused by the generation of the convolutional network are reduced.
As shown in fig. 7, the effect of the background generated by the method of the present invention is compared with the background generated by other prior art.
S2, the foreground generating part comprises the following steps:
s2.1, foreground generation data set preparation. A data preparation section. In order to train the graffiti recognition model, a large number of graffiti data sets are selected as training data. Adopts Sketchy Dataset published on the network 【8】 As training data. A large amount of augmented picture data are obtained by carrying out mirror image and cutting operation on the scrawling original image. And storing the scrawling data into a corresponding folder according to the label. The initial size of the original graffiti picture is (1 × 256 × 256), and the initial size corresponds to a color channel, a picture width and a picture height. The graffiti data is divided into a total of 30 categories in a subsequent classification process.
And S2.2, using a computer to visually process the scrawling. And recognizing the input image by adopting a computer vision method. Since the input of one image may contain a plurality of objects, the image needs to be input into the network after being divided into blocks. The blocking operation needs to perform corrosion and expansion operation processing on the picture, wherein the corrosion iteration coefficient is 5, and the expansion iteration coefficient is 3. After that, how many blocks are in the recording chart is calculated by the connected elements, the central point position of each block is recorded, and the blocks are respectively scaled to (1 × 256 × 256). And finally, inputting the processed picture into a depth model for identification.
And S2.3, using the depth model to perform scribble recognition. And identifying the doodle data by adopting an identification network. The identification network architecture is based on an inverted residual error structure, the main branch of the original residual error structure has three convolutions, the number of two point-by-point convolution channels is large, the inverted residual error structure is just opposite, the number of the middle convolution channels (the depth separation convolution structure is still used) is large, and the number of the middle convolution channels is small. Furthermore, it was found to be effective to remove the non-linear transformations in the main branches, which may preserve model expressiveness. Using cross entropy as a loss function, a comprehensive accuracy of 81% can be achieved after training 70 epochs. According to the feedback result of the recognition model, the obtained labels are indexed in the corresponding picture database, the nearest neighbor model is adopted to obtain the picture with the highest approximation degree, and finally the picture is fused to the corresponding position of the background, and the final effect is shown in fig. 8.
By adopting the technical scheme, the generated background picture is vivid and is close to a real scene through the confrontation generation network; and generating a foreground by adopting an image retrieval mode, and finally superposing the foreground and the background to obtain a result meeting the requirements of the user, thereby achieving the aesthetic affirmation of the user.
Reference documents:
1.I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.Courville,and Y.Bengio.Generative adversarial nets.In Advances in neural information processing systems,pages 2672–2680,2014.
2.A.Radford,L.Metz,and S.Chintala.Unsupervised representation learning with deep convolutional generative adversarial networks.arXiv preprint arXiv:1511.06434,2015.
3.M.Mirza and S.Osindero.Conditional generative adversarial nets.2014.
4.P.Isola,J.-Y.Zhu,T.Zhou,and A.A.Efros.Image-to-image translation with conditional adversarial net-works.arXiv preprint arXiv:1611.07004,2016.
5.Zhu JY,Park T,Isola P,Efros AA.Unpaired image-to-image translation using cycle-consistent adversarial networks.arXiv preprint.2017.
6.Wang X,Gupta A.Generative image modeling using style and structure adversarial networks[C]//European Conference on Computer Vision.Springer,Cham,2016:318-335.
7.Karras T,Aila T,Laine S,et al.Progressive growing of gans for improved quality,stability,and variation[J].arXiv preprint arXiv:1710.10196,2017.
8.Sangkloy,Patsorn,et al."The sketchy database:learning to retrieve badly drawn bunnies."ACM Transactions on Graphics(TOG)35.4(2016):119.

Claims (10)

1. a photo-level image generation method based on semantic content and rapid image retrieval is characterized in that: the method comprises a background generation part and a foreground production part, and comprises the following specific steps:
s1, the background generating part, comprising the steps of:
s1.1, acquiring a training data set for training a background picture generation model: selecting a large number of color images as a target I G And determining the scene category by labeling to obtain a semantic segmentation chart I of each color picture s Obtaining augmented picture data through mirroring and cutting respectively, and taking a matching data pair form as a training data set for deep learning;
s1.2, inputting the color images in the training data set into a coder network to execute a feature extraction stage, and reconstructing Gaussian distribution of a corresponding style;
s1.3, using Gaussian distribution obtained by the reconstruction of the encoder network as the input of a generator network, and assisting an inserted semantic graph to obtain and output enhanced semantic graph features;
s1.4, inputting the one-hot semantic graph and the splicing matrix of the output result of the generator network into the discriminator network to discriminate the real degree of the background generation, wherein the expression formula of the discriminator network is as follows:
I′ O =ReLU((ReLU(W d1 *concat(I s ,F′)+B d1 )×W d2 +B d2 )…×W dn +B dn )
wherein W d1 、W d2 、W dn 、B d1 、B d2 And B dn Respectively representing the weights and bias parameters of the first, second and nth convolutional layers, n being the number of convolutional layers, F' being the output of the generator, I s Represents the output of the deconvolution stage;
s1.5, calculating and obtaining a loss function, specifically:
s1.5.1, comparing the generated image obtained by the generator with the corresponding original color image, and calculating a Percentual Loss penalty function, wherein the penalty function is expressed as:
Figure FDA0002185606530000011
where j denotes the j-th layer of the network, C j H j W j The size of the characteristic diagram of the j layer is shown, and phi represents a network;
s1.5.2, while using a hindeloss based loss function as the optimized loss for GAN, the loss function is:
Figure FDA0002185606530000012
Figure FDA0002185606530000013
wherein D represents a discriminator, G represents a generator, z is a hidden variable, x represents input, and y is a target;
s1.5.3, performing point-to-point loss calculation using MSE loss function;
s1.6, training a background picture generation model by adopting a progressive training strategy, dividing a training process into a plurality of preset sub-training periods, and sequentially training the sub-training periods by adopting a step growth strategy; scaling the original image into small pictures and starting training at a large learning rate when training is started, and gradually increasing the color original image and gradually reducing the learning rate after each sub-training period is finished; when the color image generated after finishing one sub-training period is compared with the corresponding original color image, the preset reconstruction effect is not obtained, the reverse propagation is continued, the gradient descent optimization algorithm is used for updating the convolution weight parameter and the bias parameter, and S1.2 is executed again; when the color image generated after completing one sub-training period reaches the expected number or completes all the preset sub-training periods, obtaining the final result;
s2, the foreground generating part comprises the following steps:
s2.1, preparing a foreground generation data set, and selecting a large number of doodle data sets as training data for training a doodle recognition model; a large amount of augmented picture data are obtained by carrying out mirror image and cutting operation on the original scrawling image, and then the scrawling data are stored into a corresponding folder according to the label;
s2.2, using computer vision to perform blocking processing on the scrawling data: carrying out corrosion and expansion operation processing on the pictures of the scrawling data, calculating the number of blocks of corresponding objects in the pictures through communicating elements, recording the central point position of each block, respectively zooming the blocks to a fixed scale,
and S2.3, inputting the blocked pictures into an identification network of the depth model to identify the graffiti, indexing the pictures from a corresponding picture database through the obtained labels according to a feedback result of the identification network, and obtaining the picture with the highest approximation degree by adopting a nearest neighbor model and fusing the picture to a corresponding position of the background.
2. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: s1.1, the initial sizes of the color images are all 3 multiplied by 256, and the initial sizes correspond to color channels, the widths of the images and the heights of the images respectively; the initial size of the semantic segmentation graph is 1 multiplied by 256, and one dimension is adopted to store label information.
3. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: the feature extraction stage in S1.2 includes four large convolutional layers, each large convolutional layer includes a convolutional layer, a batch regularization layer and a ReLU activation function, and the calculation formula of each large convolutional layer is as follows:
F=ReLU{BN{W 1 ×I g +B 1 }} (1)
wherein g represents a non-linear activation function, W 1 And B 1 Respectively representing convolutional layers in a feature extraction stageBN represents a batch normalization function, I g Representing the input picture and F the output result obtained in the feature extraction stage.
4. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: and S1.2, respectively outputting the output of the feature extraction stage to two full-connection layers through a remodeling module to obtain a mean value and a variance of Gaussian distribution.
5. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: the loss function used by the encoder network of S1.2 is KL Divergence, whose formula is:
L KLD =D KL (q(z|x)||p(z)) (2)
wherein p and q are both standard Gaussian distributions.
6. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: the generator in S1.3 comprises more than two self-defined regularized residual error unit blocks; each regularization residual error unit comprises a self-defined regularization block; random Gaussian noise is added into the input end of the custom regular block, one-hot operation is carried out on the inserted semantic graph in an auxiliary mode to obtain a matrix with the dimension being the same as the number of semantic labels, Resize of the matrix is set to be the same as the input size, then a convolution kernel and a ReLU activation function are carried out, finally, weight and variance corresponding to the semantic graph are obtained through the two convolution kernels respectively, and a final result is obtained after operation is carried out on the matrix and the inserted noise input.
7. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 6, wherein: the formula of the custom regular block in S1.3 is expressed as:
F s =ReLU(Resize(I s )×W 1 +b 1 ) (2)
W s =W 2 ×F s +b 2 (3)
b s =W 3 ×F s +b 3 (4)
F′=concat(I g ,Noise)×W s +b s (5)
wherein, W 1 、b 1 Representing the weight and offset, W, through the first convolutional layer 2 、b 2 Representing the weights and offsets convolved by the calculated weights, W 3 、b 3 Representing the weights and offsets of the calculated offset convolution, F s Representing the intermediate process, and F' represents the output of the custom regular block.
8. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 6, wherein: and S1.3, the regularization residual error unit comprises two convolution layers and a jump connection, and the output of the generator network is connected with the output of the feature extraction layer through the jump connection.
9. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: the scribble data in S2.1 is divided into 30 categories.
10. The photo-level image generation method based on semantic content and fast image retrieval as claimed in claim 1, wherein: s2.3, the doodle recognition model is an inverted residual error structure.
CN201910813199.2A 2019-08-30 2019-08-30 Photo-level image generation method based on semantic content and rapid image retrieval Active CN110634170B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910813199.2A CN110634170B (en) 2019-08-30 2019-08-30 Photo-level image generation method based on semantic content and rapid image retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910813199.2A CN110634170B (en) 2019-08-30 2019-08-30 Photo-level image generation method based on semantic content and rapid image retrieval

Publications (2)

Publication Number Publication Date
CN110634170A CN110634170A (en) 2019-12-31
CN110634170B true CN110634170B (en) 2022-09-13

Family

ID=68969687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910813199.2A Active CN110634170B (en) 2019-08-30 2019-08-30 Photo-level image generation method based on semantic content and rapid image retrieval

Country Status (1)

Country Link
CN (1) CN110634170B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145128B (en) * 2020-03-02 2023-05-26 Oppo广东移动通信有限公司 Color enhancement method and related device
CN111461250A (en) * 2020-04-09 2020-07-28 上海城诗信息科技有限公司 Street view model generation method, device and system and storage medium
CN111563482A (en) * 2020-06-18 2020-08-21 深圳天海宸光科技有限公司 Gas station dangerous scene picture generation method based on GAN
CN111967533B (en) * 2020-09-03 2022-09-23 中山大学 Sketch image translation method based on scene recognition
CN116250021A (en) * 2020-11-13 2023-06-09 华为技术有限公司 Training method of image generation model, new view angle image generation method and device
CN112508991B (en) * 2020-11-23 2022-05-10 电子科技大学 Panda photo cartoon method with separated foreground and background
CN112699885A (en) * 2020-12-21 2021-04-23 杭州反重力智能科技有限公司 Semantic segmentation training data augmentation method and system based on antagonism generation network GAN
CN112685590B (en) * 2020-12-29 2022-10-14 电子科技大学 Image retrieval method based on convolutional neural network regularization processing
CN115454356B (en) * 2022-10-26 2023-01-24 互联时刻(北京)信息科技有限公司 Data file processing method, device and equipment based on recognition and aggregation algorithm
CN117351520B (en) * 2023-10-31 2024-06-11 广州恒沙数字科技有限公司 Front background image mixed generation method and system based on generation network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109064423A (en) * 2018-07-23 2018-12-21 福建帝视信息科技有限公司 It is a kind of based on unsymmetrical circulation generate confrontation loss intelligence repair drawing method
CN109712203A (en) * 2018-12-29 2019-05-03 福建帝视信息科技有限公司 A kind of image rendering methods based on from attention generation confrontation network
CN110175251A (en) * 2019-05-25 2019-08-27 西安电子科技大学 The zero sample Sketch Searching method based on semantic confrontation network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018053340A1 (en) * 2016-09-15 2018-03-22 Twitter, Inc. Super resolution using a generative adversarial network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109064423A (en) * 2018-07-23 2018-12-21 福建帝视信息科技有限公司 It is a kind of based on unsymmetrical circulation generate confrontation loss intelligence repair drawing method
CN109712203A (en) * 2018-12-29 2019-05-03 福建帝视信息科技有限公司 A kind of image rendering methods based on from attention generation confrontation network
CN110175251A (en) * 2019-05-25 2019-08-27 西安电子科技大学 The zero sample Sketch Searching method based on semantic confrontation network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
生成式对抗网络及其计算机视觉应用研究综述;曹仰杰等;《中国图象图形学报》;20181016(第10期);全文 *

Also Published As

Publication number Publication date
CN110634170A (en) 2019-12-31

Similar Documents

Publication Publication Date Title
CN110634170B (en) Photo-level image generation method based on semantic content and rapid image retrieval
Portenier et al. Faceshop: Deep sketch-based face image editing
CN109508669B (en) Facial expression recognition method based on generative confrontation network
Deng et al. Aesthetic-driven image enhancement by adversarial learning
AU2017101166A4 (en) A Method For Real-Time Image Style Transfer Based On Conditional Generative Adversarial Networks
CN111767979A (en) Neural network training method, image processing method, and image processing apparatus
US20150347820A1 (en) Learning Deep Face Representation
CN108171649B (en) Image stylization method for keeping focus information
CN110570377A (en) group normalization-based rapid image style migration method
CN112686817B (en) Image completion method based on uncertainty estimation
CN112686816A (en) Image completion method based on content attention mechanism and mask code prior
WO2023151529A1 (en) Facial image processing method and related device
Shi et al. Exploiting multi-scale parallel self-attention and local variation via dual-branch transformer-CNN structure for face super-resolution
Uddin et al. A perceptually inspired new blind image denoising method using $ L_ {1} $ and perceptual loss
CN117078510A (en) Single image super-resolution reconstruction method of potential features
Gilbert et al. Disentangling structure and aesthetics for style-aware image completion
CN114820303A (en) Method, system and storage medium for reconstructing super-resolution face image from low-definition image
CN111695507B (en) Static gesture recognition method based on improved VGGNet network and PCA
Ueno et al. Continuous and gradual style changes of graphic designs with generative model
CN110569763B (en) Glasses removing method for fine-grained face recognition
CN109035318B (en) Image style conversion method
CN113538507B (en) Single-target tracking method based on full convolution network online training
CN114170460A (en) Multi-mode fusion-based artwork classification method and system
CN113112397A (en) Image style migration method based on style and content decoupling
Shahbakhsh et al. Enhancing face super-resolution via improving the edge and identity preserving network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant