Background
On the basis of improving the living standard and the aesthetic value, the life recording is carried out through the image acquisition equipment, so that the universal demand of the public is met. In photography, the shot effect is considered to be one of the most important aesthetic criteria. Under the development of the prior art, a single lens reflex of a large-aperture lens can easily render a shot image with a natural effect, but as a common audience for image recording through a mobile phone, the smart phone is difficult to be provided with the large-aperture lens and other special sensors, so that the existing smart phone terminal is difficult to acquire a photo with the shot effect.
In the development of synthetic shot effect rendering, a semantic segmentation method is generally used for segmenting people from an image, and then other areas are blurred, so that the attention points are only processing of photos of the people, the limitation is strong, and the photos with richer scenes cannot be processed. Although the synthetic shot effect rendering of the smart phone is realized by relying on special or expensive hardware, the method is not suitable for the market of low-end smart phones.
Disclosure of Invention
The purpose of the invention is as follows: an object is to provide a method for creating a shot effect rendering model based on an antagonistic generation network model, so as to solve the above problems in the prior art. A further object is to propose a system implementing the above method.
The technical scheme is as follows: a confrontation generation network model for training a shot rendering comprises a judger and a generator, wherein the judger is used for receiving pictures, carrying out neural network training and supervising the difference between a generated image block and a shot image block in a data set; the generator is an end-to-end convolutional neural network and is used for outputting the picture subjected to shot rendering.
In a further embodiment, the determiner is further configured to:
the multi-sensing visual field decision device is used for receiving the picture data set with rich scenes and training the neural network; and meanwhile, the method is used for monitoring the difference between the generated image block and the corresponding scattered image block in the data set and paying attention to the details of the image blocks at the same position and different sizes.
The picture sets contain rich scenes and appear in pairs; and the pictures which appear in pairs and have no blurring effect in the picture set correspond to one picture with a shot effect, wherein the picture without the shot effect is used as a training set, and the picture with the shot effect is used as a label set for supervised learning.
The generator is further a two-stage network, and the two networks both adopt a structure consisting of an encoder and a decoder; the two stages are divided into a first stage and a second stage, and in the first stage, the training network learns the mapping from the image without the blurring effect to the residual error of the input image and the corresponding image with the shot effect; in the second stage, the vivid panoramic effect is generated by thinning;
in the first stage, the number of basic channels of the network is 16, and the maximum number of channels is 128; the relation between the input image I and the output image in the corresponding label set, which is O with a shot effect, and the residual R is R-I-O, so that I-R represents a rough picture with a shot rendering effect; residual errors R generated in the network at the stage have certain depth information, namely, extra depth information is not needed to be used as priori knowledge;
in the second stage, the number of basic channels of the network is 32, and the maximum number of channels is 256; the rough picture with the shot rendering effect generated in the first stage is refined to generate a shot effect picture with a vivid effect;
in the generator composition structure, an encoder block formed by the encoder is a convolutional layer with the step length of a preset length, comprises three downsampling layers and is used for outputting a characteristic diagram;
in the generator composition structure, a decoder block formed by the decoder is a transposed convolutional layer with three step lengths being preset lengths, and is used for receiving a feature map converted by a residual block; the feature map of the residual block conversion is obtained by converting the feature map output by the encoder block by adopting a preset number of residual blocks; the residual block is sequentially connected with conv, ReLU, instancenorm, conv and ReLU layers, and an additive connection exists between the input and the output;
the convolution layer of the encoder block and the transposition convolution layer of the decoder block are both activated by ReLU, the output layers of the two-stage network are realized by convolution with the step length of a preset length and a tanh function, and skip connection is arranged between the convolution layer and the transposition convolution layer of the mirror image of the convolution layer for enhancing details of an output image.
A method for establishing a shot effect rendering model based on an confrontation generation network comprises the following steps:
step one, obtaining a picture for training;
secondly, putting the obtained picture into a neural network for network training;
and step three, obtaining a confrontation generation network model for training the shot rendering.
In a further embodiment, the first step is further: in supervised learning, the training pictures have one label picture in one-to-one correspondence in a constructed training set, and the label picture is a picture with a shot rendering effect.
In a further embodiment, the second step is further: the neural network used for training is a confrontation generation network consisting of an end-to-end generator in the shape of glasses and a multi-receptive-field discriminator. The decision device is used for receiving the pictures, carrying out neural network training and supervising the difference between the generated image blocks and the scattered image blocks in the data set. The generator is an end-to-end convolutional neural network and is used for outputting the picture subjected to shot rendering.
The generator is further a two-stage network, and both networks adopt a structure formed by an encoder and a decoder. The two stages are divided into a first stage and a second stage, and in the first stage, the training network learns the mapping from the image without the blurring effect to the residual error of the input image and the corresponding image with the shot effect; in the second stage, the thinning produces vivid panoramic effect.
In the first stage, the number of basic channels of the network is 16, and the maximum number of channels is 128; the relation between the input image I and the output image in the corresponding label set, which is O with a shot effect, and the residual R is R-I-O, so that I-R represents a rough picture with a shot rendering effect; the residual error R generated in the network at this stage has certain depth information, i.e. no additional depth information is needed as a priori knowledge. In the second stage, the number of basic channels of the network is 32, and the maximum number of channels is 256; and refining the rough picture with the shot rendering effect generated in the first stage to generate a shot effect picture with a vivid effect.
In the generator composition structure, an encoder block composed of an encoder is a convolution layer with the step length of a preset length, comprises three down-sampling layers and is used for outputting a characteristic diagram; the decoder block composed of the decoder is a transposed convolutional layer with a set number step length as a preset length and is used for receiving a feature map converted by a residual block. Wherein the residual block is connected with conv, ReLU, instancenorm, conv and ReLU layers in sequence, and an additive connection exists between the input and the output. The feature map of the transform is obtained by transforming the feature map output by the encoder block by using 9 residual blocks.
The convolutional layer of the encoder block and the transposed convolutional layer of the decoder block are both activated by ReLU, the output layers of the two-stage network are both realized by convolution with the step length of a preset length and tanh, and skip connection is used for enhancing the details of the output image between the convolutional layers and the transposed convolutional layers of the mirror image.
In the network training process, normalization is carried out on the instances in the residual block in the generator by utilizing a lightweight frame TensorFlow Lite, so that the task of image-to-image translation is carried out, and further, the instance normalization is realized again by using an operator supported by the TensorFlow Lite frame; i.e. the calculation is performed according to a single channel of a single sample, wherein the example normalization is expressed as:
wherein the content of the first and second substances,
denotes the t-th
ij
kElements where k and j are the height and width across the spatial dimension, i is the signal of the feature, t is the index of the image in batch,
the mean value is represented by the average value,
represents the variance; calculated from the constant size of the feature map for each layer using tf.nn.avg _ pool2d
The size of (2).
The loss function involved in training the network is:
Ltotal=0.5*L1+0.05*LSSIM+0.1*LVGG+Ladv
wherein L is1To mean absolute error, LSSIMFor structural similarity, LVGGFor perception of loss, LadvRepresenting a loss of confrontation;
wherein H represents the height of the image, W represents the width of the image, C represents the number of channels of the image, and F (-) is pre-trained on ImageNetFeature map of the 34 th layer output of VGG19 network, G (I)i,j,kFor pictures generated by the generator, Ci,j,kD (-) is the output of the discriminator, for the corresponding original picture with the shot effect.
In a further embodiment, the third step is further: parameter optimization of a training network model is realized through supervised learning of a large number of picture training sets until a function is converged to obtain a generation model which can obtain a picture with a shot rendering effect from one picture.
A shot effect rendering system based on an confrontation generation network is characterized in that firstly, a confrontation generation network model for training shot rendering, which is obtained after training, is stored and converted into a tflite file; secondly, deploying the tflite file to a mobile phone terminal; thirdly, inputting a picture without a background blurring effect into the deployed mobile phone terminal; then, calling a GPU (graphics processing Unit) of the mobile phone end; and finally, obtaining a picture which is subjected to shot rendering through the neural network. The method specifically comprises the following steps:
a first module for obtaining a shot rendering model for a confrontation-based generation network.
And the second module is used for deploying the model for obtaining the shot rendering effect based on the countermeasure generation network to the mobile phone end, and the model saves the shot rendering model for obtaining the shot rendering effect based on the countermeasure generation network, which is obtained by the first module, converts the model into tflite files which can be deployed to the mobile phone end and further deploys the tflite files to the mobile phone end of the user.
And the third module is used for calling the GPU of the mobile phone to operate, inputting a picture needing to be subjected to shot rendering after the model deployment is completed, calling the GPU of the mobile phone to start to operate after the picture is received, and starting to perform shot rendering of the picture.
And the fourth module is used for obtaining and presenting the picture with the shot rendering effect, and the picture with the shot rendering effect generated by the third module is visually presented by the fourth module.
Has the advantages that: the invention provides a shot effect rendering method based on an opposition generation network and a system for realizing the method, which are characterized in that a lightweight network is designed, and meanwhile, an operator supported by a TensorFlow Lite frame is used for realizing instance normalization again, so that all operators for training a shot rendering model by the opposition generation network consisting of a glasses-shaped end-to-end generator and a multi-sense-field discriminator can be calculated on a GPU of a smart phone without occupying large resources. The invention also meets the aims of obviously detecting the area to be focused while not depending on a prior method, leading the blurring effect of the out-of-focus area to be natural, and being suitable for the condition of multiple scenes rather than only processing specific scenes such as portrait and the like.
Detailed Description
The applicant believes that although there are smart phone composite shot effect rendering implementations relying on special or expensive hardware, this is not suitable for the market of low-end smart phones.
In order to solve the problems in the prior art, and the shot effect rendering can be deployed to low-end smart phone equipment, the invention provides a shot rendering method based on an confrontation generation network and a system for realizing the method.
The present invention will be further described in detail with reference to the following examples and accompanying drawings.
In the present application, we propose a method for rendering a shot based on a countermeasure generation network and a system for implementing the method, where the included method for rendering a shot based on a countermeasure generation network specifically includes the following steps:
step one, obtaining a picture for training; the method comprises the steps of adopting a data set with rich scene paired pictures, wherein each picture without blurring effect corresponds to a picture with a shot effect in a picture training set.
Secondly, putting the obtained picture into a neural network for network training; the neural network used for training is a confrontation generation network consisting of an end-to-end generator in the shape of glasses and a multi-receptive-field discriminator. The decision device is used for receiving the pictures, carrying out neural network training and supervising the difference between the generated image blocks and the scattered image blocks in the data set. The generator is an end-to-end convolutional neural network and is used for outputting the picture subjected to shot rendering.
The discriminator used by the invention can be used as a strategy for generating more vivid shot images, and adopts a multi-receptive-field supervision mode, and the schematic diagram of the discriminator is shown in fig. 3. The discriminator designed by the invention can monitor the difference between the generated image blocks with the size of 70 multiplied by 70 and the scattered image blocks in the corresponding data set. Meanwhile, the depth of the PatchGAN discriminator is considered and modified in the design process, so that the network can pay attention to the details of the image blocks at the same position and in different sizes. The present network combines patch gan discriminators with different depths into a multi-field discriminator in the structure of the countermeasure section. The monitoring mode of multiple receptive fields is beneficial to the generator to generate the result which is more in line with the sense of human eyes.
The invention adopts fewer generator parameters and smaller models, and is easy to operate on portable equipment such as smart phones and the like. As shown in fig. 2, the generator is further a two-stage network, and both networks adopt a structure composed of an encoder and a decoder. The two stages are divided into a first stage and a second stage, and in the first stage, the training network learns the mapping from the image without the blurring effect to the residual error of the input image and the corresponding image with the shot effect; in the second stage, the thinning produces vivid panoramic effect.
In the first stage, the number of basic channels of the network is 16, and the maximum number of channels is 128; the relation between the input image I and the output image in the corresponding label set, which is O with a shot effect, and the residual R is R-I-O, so that I-R represents a rough picture with a shot rendering effect; the residual error R generated in the network at this stage has certain depth information, i.e. no additional depth information is needed as a priori knowledge. In the second stage, the number of basic channels of the network is 32, and the maximum number of channels is 256; and refining the rough picture I-R with the shot rendering effect generated in the first stage to generate a shot effect picture with a vivid effect.
In the generator composition structure, an encoder block composed of an encoder is a convolution layer with the step length of 2, comprises three down-sampling layers and is used for outputting a characteristic diagram; the decoder block, consisting of decoders, is three transposed convolutional layers of step size 2, for receiving the signature transformed by the residual block. Wherein the residual block is connected with conv, ReLU, instancenorm, conv and ReLU layers in sequence, and an additive connection exists between the input and the output. The feature map of the transform is obtained by transforming the feature map output by the encoder block by using 9 residual blocks.
The convolutional layer of the encoder block and the transposed convolutional layer of the decoder block are both activated by ReLU, the output layers of the two-stage network are both realized by convolution with step size of 1 plus tanh, and skip connection is used for enhancing details of the output image between the convolutional layers and the transposed convolutional layers mirrored therefrom.
In the process of network training, because the embodiment normalization exists in the residual block in the generator, if the embodiment normalization is removed, the generated image can not generate natural and vivid panoramic effect like a network with the embodiment normalization, and meanwhile, in order to more conveniently deploy the model to a mobile phone end, the invention adopts a lightweight framework TensorFlow Lite.
The lightweight framework TensorFlow Lite normalizes the instances in the residual block in the generator to perform the task of image-to-image translation, and the computation is performed according to a single channel of a single sample, wherein the normalization of the instances is expressed as:
wherein the content of the first and second substances,
denotes the t-th
ij
kElements where k and j are the height and width across the spatial dimension, i is the signal of the feature, t is the index of the image in batch,
the mean value is represented by the average value,
represents the variance; calculated from the constant size of the feature map for each layer using tf.nn.avg _ pool2d
The size of (2).
The TensorFlow Lite framework does not support acceleration of the GPU of the mobile phone for instance normalization, so that additional memory overhead from the CPU to the GPU synchronization is increased by using the instance normalization, and the time for processing the image is greatly increased. In order to solve the problem, the embodiment normalization is realized again by using an operator supported by a TensorFlow Lite framework, so that all operations of finally obtaining the model can be carried out on a GPU at the end of the smart phone. Through the application verification and the re-realized example normalization, the running speed of the model established by the invention on the mobile phone is improved by nearly 6 times.
The loss function involved in training the network is:
Ltotal=0.5*L1+0.05*LSSIM+0.1*LVGG+Ladv
wherein L is1Is the mean absolute error, LSSIMFor structural similarity, LVGGFor sensing loss, LadvRepresenting a loss of confrontation; l isVGGTransmitting the input and the output into a VGG19 model pre-trained on an ImageNet data set, and taking a feature map corresponding to the 34 th layer of the VGG19 to calculate an average absolute error; l isadvRepresenting the countermeasure loss, and improving the final output effect of the generator through the countermeasure of the generator and the discriminator.
Wherein F (-) is the feature map of the output at layer 34 of VGG19 network pre-trained on ImageNet, G (I)i,j,kFor pictures generated by the generator, Ci,j,kD (-) is the output of the discriminator, for the corresponding original picture with the shot effect.
Step three, obtaining a confrontation generation network model for training the shot rendering; after supervised learning of a large number of picture training sets, a network model which achieves optimization of parameters of the network model until function convergence is achieved is obtained, and the network model can achieve the purpose of obtaining a picture with a shot rendering effect from one picture.
Based on the method, a system for realizing the method can be constructed, and the realization system firstly stores and converts an confrontation generation network model for training the shot rendering, which is obtained after training, into a tflite file; secondly, deploying the tflite file to a mobile phone terminal; thirdly, inputting a picture without a background blurring effect into the deployed mobile phone terminal; then, calling a GPU (graphics processing Unit) of the mobile phone end; and finally, obtaining a picture which is subjected to shot rendering through the neural network. The method specifically comprises the following steps:
and the first module is used for obtaining a shot rendering model based on the confrontation generation network and extracting the model which is trained in the previous stage.
And the second module is used for deploying the model for obtaining the shot rendering effect based on the countermeasure generation network to the mobile phone end, and the model saves the shot rendering model for obtaining the shot rendering effect based on the countermeasure generation network, which is obtained by the first module, converts the model into tflite files which can be deployed to the mobile phone end and further deploys the tflite files to the mobile phone end of the user.
And the third module is mainly used for transmitting a GPU operation instruction after receiving an input picture, specifically, the mobile phone input end receives a picture to be shot rendered, the GPU operation instruction is triggered after receiving the picture, and then the GPU of the mobile phone end starts to operate and starts to perform shot rendering of the picture.
And the fourth module is used for obtaining and presenting the picture with the shot rendering effect generated by the third module on a visual interface of the mobile phone end.
As shown in fig. 4, comparing the effect generated by the shot rendering method of the present invention with the algorithms proposed by Dutta and PyNET, it can be clearly seen that the shot effect graph generated by the present invention is most natural, the object in the focusing area in the generated shot image is clearly visible, the foreground and the background can be well separated, and the other two results are somewhat blurred, so that the foreground and the background cannot be separated. Wherein shown from left to right in fig. 4 are the input picture, the result of the algorithm proposed by Dutta, the result proposed by PyNET and the result after application of the present invention. Since the PyNET and Dutta et al methods rely heavily on the Megadepth map generated, both methods produce poor results once the depth map fails to provide accurate depth information.
The invention can be applied to the mobile phone and realizes the real-time shot picture rendering of the mobile phone. After the shot effect rendering system based on the countermeasure generation network is used, the mobile phone can finish processing the photo by only needing a common photo shot by a camera without depending on an expensive camera module and a plurality of camera systems and without estimating a depth map in advance, highlight the main part of the photo and blurring the background part of the photo. Fig. 5 shows the result of processing the photos taken by part of the mobile phone by using the invention. The left side is an original photo shot by the mobile phone, and the right side is a photo with a shot effect processed by the algorithm. It can be seen that the method provided by the invention realizes natural shot effect and keeps the definition of the main body part.
As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.