CN107968962A

CN107968962A - A kind of video generation method of the non-conterminous image of two frames based on deep learning

Info

Publication number: CN107968962A
Application number: CN201711343243.5A
Authority: CN
Inventors: 温世平; 刘威威
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-12-12
Filing date: 2017-12-12
Publication date: 2018-04-27
Anticipated expiration: 2037-12-12
Also published as: CN107968962B

Abstract

The invention discloses a kind of video generation method of the non-conterminous image of two frames based on deep learning, belong to confrontation study and video generation field, N frame input pictures are obtained including carrying out linear interpolation processing to the non-conterminous image of two frames, N frames input picture is inputted into the first maker, obtains the video image that the N frames between the non-conterminous image of two frames obscure；N frame video images are inputted into trained second maker, obtain new N frames clearly video image, and the non-conterminous image of two frames and new N frame video images connect generation video.Wherein, using complete the first depth of convolution layer building own coding convolutional network, dual training is used, obtain trained first maker, using full convolutional layer and parallel link the second depth own coding convolutional network of structure is carried out, using dual training, obtains trained second maker.The video quality that the present invention generates is good, time length.

Description

A kind of video generation method of the non-conterminous image of two frames based on deep learning

Technical field

The invention belongs to resist study and video generation field, more particularly, to it is a kind of based on deep learning two The video generation method of the non-conterminous image of frame.

Background technology

The prediction of video generation is always the problem of computer vision field, and the algorithm of traditional non-deep learning is difficult life Into the video of continuous high quality, but in fact video generation and prediction can be used among many fields, such as behavior point Analysis, intelligent monitoring, video estimation, cartoon making etc..

The basic theories of deep learning just has been proposed in the last century 80's, Yuan Lecun et al., but is used for Level of hardware at that time can not meet that it is calculated and require, so Artificial Intelligence Development is slow, but carrying with level of hardware Height, the rise of deep learning, replaces the method for the feature of engineer to be adopted extensively with the feature of convolutional neural networks study With this method overcomes the difficulty of conventional method algorithm for design artificial like that, but employs and build neutral net, passes through ladder The parameter of the optimization algorithm optimization network such as degree decline, and then network is fitted an extraordinary nonlinear function, instead of Artificial algorithm for design.

The conventional video generation method major prognostic video next frame or multiple image based on deep learning, Huo Zhedong Predict.Main is exactly to input to one frame of network or multiframe still image, using ensuing frame as prediction object, training god Go to complete from output is input to through network, mapping as from past frame to future frame is that is to say, when neural network learning arrives During the function of relatively good mapping.Trained neutral net some video frame are inputed to, neutral net can input future Frame appearance.But the video of prediction is often relatively fuzzyyer, when especially predicting the video of long sequence, foreseeable video length Degree is also very limited, can only often predict the video that several frames obscure.These difficult serious video estimations that limit are answered with what is generated With.In addition, a target is given, on the premise of unknown next target motion result, a variety of movements of this target can Can, correspond to the result of video generation has unlimited a variety of solutions.But for our mankind, when we have seen that people in video When smiling, next probability that they embrace action is very big, but for a neutral net, they do not have The so long temporal information of capable understanding and contextual information.Second difficulty is exactly to be hardly produced the preferable image of quality Sequence, most generation result is all very fuzzy, is hardly produced longer image sequence, can only do the motion analysis etc. of short time Deng, these generation is very difficult to apply in cartoon making, short-sighted frequency generation.

It can be seen from the above that the prior art exists and generates or of poor quality, the time short technical problem of prediction video.

The content of the invention

For the disadvantages described above or Improvement requirement of the prior art, the present invention provides a kind of two frames based on deep learning not The video generation method of adjacent image, thus solves the prior art and exists to generate or of poor quality, the time short skill of prediction video Art problem.

To achieve the above object, the present invention provides a kind of video generation of non-conterminous image of two frames based on deep learning Method, including：

(1) linear interpolation processing is carried out to the non-conterminous image of two frames and obtains N frame input pictures, N frames input picture is inputted Trained first maker, obtains the N frame video images between the non-conterminous image of two frames；

(2) N frame video images are inputted into trained second maker, obtains new N frame video images, and two frames are not Adjacent image and new N frame video images connect generation video；

The training of first maker includes：Using complete the first depth of convolution layer building own coding convolutional network, to One depth own coding convolutional network uses dual training, obtains trained first maker；The training of second maker Including：Using full convolutional layer and carry out parallel link the second depth own coding convolutional network of structure；Second depth own coding is rolled up Product Web vector graphic dual training, obtains trained second maker.

Further, the training of the first maker includes：

(S1) complete the first depth of convolution layer building own coding convolutional network is used, it is non-conterminous that two frames are obtained from Sample video N frame true pictures in sample image and the non-conterminous sample image of two frames；

(S2) linear interpolation processing is carried out to the non-conterminous sample image of two frames and obtains N frame samples input picture input first deeply Own coding convolutional network is spent, the first depth own coding convolutional network is trained with the minimum target of loss function, obtains N frames First training image, the first differentiation result is obtained by the first training image of N frames and N frames true picture input arbiter；

(S3) when the first differentiation result is more than threshold value, repeat step (S2), when the first differentiation result is less than or equal to threshold value When, obtain trained first maker.

Further, the training of the second maker includes：

(T1) using full convolutional layer and carry out parallel link build the second depth own coding convolutional network；

(T2) the first training image of N frames is inputted into the second depth own coding convolutional network, with the minimum target of loss function Second depth own coding convolutional network is trained, obtains the second training image of N frames, the second training image of N frames and N frames is true Real image input arbiter obtains the second differentiation result；

(T3) when the second differentiation result is more than threshold value, repeat step (T2), when the second differentiation result is less than or equal to threshold value When, obtain trained second maker.

The present invention generates continuous video using non-conterminous two field picture, instead of the side that next frame is predicted according to previous frame Method.In order to improve generation quality, the structure of twin series connection of growing up to be a useful person has been used, it is twin to grow up to be a useful person with different tasks, it may have no Same network structure, first maker are responsible for from the input frame learning that interleave obtains to motion characteristic, second maker Improve the quality of image on the basis of first maker, two makers connect to obtain the video generation of high quality as a result, And it can realize that end-to-end mode is trained.Devise new loss function：Normalization product associated loss function, used in training During improve generation result quality.

Further, every layer of convolutional layer in the first depth own coding convolutional network and the second depth own coding convolutional network One RELU nonlinear function is set afterwards.

Further, arbiter includes 6 convolutional layers and a full articulamentum, and setting gradually one after every layer of convolutional layer returns One changes operation and a RELU nonlinear function.

Further, loss function is：

Loss=λ₁L_adv+λ₂L_mse+λ₃L_gdl+λ₄L_npcl

Wherein, Loss is loss function, L_advTo resist loss function, λ₁To resist the weight of loss function, L_mseTo be equal Variance loss function, λ₂For the weight of mean square deviation loss function, L_gdlFor gradient loss function, λ₃For the power of gradient loss function Weight, L_npclFor normalization product associated loss function, λ₄For the weight of normalization product associated loss function.

In general, by the contemplated above technical scheme of the present invention compared with prior art, it can obtain down and show Beneficial effect：

(1) present invention uses input of non-conterminous two frame as maker, it is that video is given birth to that the second frame, which can be used as, Into bound term, therefore the dimension of solution space can be significantly reduced, make generation become to be more prone to, while use dual training It is more suitable for the generation of image.In addition it is exactly that the generation network for having used two maker cascades goes generation video, different generations Device is responsible for different tasks, and has different network structures, the quality higher of the result of two maker generations, generation Video frame it is more.

(2) for the present invention by the way of dual training, maker and arbiter form confrontation network, resist network and confrontation Training is combined the generation for being more suitable for image, is confrontation loss function respectively using four loss functions, mean square deviation loses letter Number, gradient loss function and normalization product associated loss function, generation result is punished from different aspect, make generation result and Legitimate reading has very strong similitude.

(3) present invention can generate longer video sequence compared with method before, and ensure the matter of video generation Amount.Action prediction, video compress, video generation field can be widely used in.

Brief description of the drawings

Fig. 1 is a kind of video generation method of non-conterminous image of two frames based on deep learning provided in an embodiment of the present invention Flow chart；

Fig. 2 (a) is the first analogous diagram provided in an embodiment of the present invention；

Fig. 2 (b) is second of analogous diagram provided in an embodiment of the present invention；

Fig. 2 (c) is the third analogous diagram provided in an embodiment of the present invention；

Fig. 2 (d) is the 4th kind of analogous diagram provided in an embodiment of the present invention.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below Not forming conflict each other can be mutually combined.

As shown in Figure 1, a kind of video generation method of the non-conterminous image of two frames based on deep learning, including：

The training of first maker includes：

(S1) complete the first depth of convolution layer building own coding convolutional network is used, as shown in table 1, without using pond layer and is returned One changes layer, all using convolution layer building network, and the non-thread of relu activation primitives increase network is used behind each layer Sexuality.To avoid the influence of random noise, we use a kind of network structure of own coding formula, on the one hand can increase generation The symmetry of the topological structure of network model, on the other hand can also lift the stability of overall network.

Table 1

First depth own coding convolutional network is as follows：

First layer convolutional layer, convolution kernel size 5*5, output characteristic figure quantity 64, step-length 1；

Second layer convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 128, step-length 2；

Third layer convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 128, step-length 1；

4th layer of convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 2；

Layer 5 convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 1；

Layer 6 convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 1；

Layer 7 convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 1；

8th layer of convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 512, step-length 1；

9th layer of convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 512, step-length 1；

Tenth layer of convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 1；

Eleventh floor transposition convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 2；

Floor 12 convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 1；

13rd layer of transposition convolutional layer, convolution kernel size 4*4, output characteristic figure quantity 64, step-length 2；

14th layer of convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 3, step-length 1；

In the first depth own coding convolutional network, using multilayer convolutional layer, mainly for allowing maker more accurately to learn The movable information of target in video is practised, is prepared for ensuing generation.

Secondly as needing a maker and arbiter using the method for dual training, we have built a differentiation Output of the device network to maker, which is done, to be differentiated, in arbiter, there is a normalization (Batch behind each layer of convolution Normalization) operation, followed by a RELU nonlinear function, strengthen the non-thread sexuality of network, because arbiter is defeated What is gone out is the differentiation to true image and fault image, so in last layer of network, we use full articulamentum, its network structure It is as follows：

First layer convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 128, step-length 2；

Second layer convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 1；

Third layer convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 2；

4th layer of convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 1；

Layer 5 convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 128, step-length 2；

Layer 6 convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 128, step-length 1；

The full articulamentum of layer 7, output neuron 1.

The N frames obtained from Sample video in the non-conterminous sample image of two frames and the non-conterminous sample image of two frames are truly schemed Picture；

The training of second maker includes：

(T1) using full convolutional layer and carry out parallel link build the second depth own coding convolutional network；As shown in table 2,

Table 2

Different from the first maker, used parallel link, will before the obtained characteristic pattern of several layers of convolutional layer convolution and after Characteristic pattern that several layers of convolution obtain and together collectively as the input of next layer of convolution, this have the advantage that network more holds The feature of easy synthetic image, plus dual training, the image of output and real image have more like structural information.

Second depth own coding convolutional network structure is as follows：

First layer convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 128, step-length 1；

Layer 5 convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 2；

Layer 7 transposition convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 256, step-length 2；

512 features are obtained together with 4th layer of 256 obtained characteristic pattern is cascaded to the convolutional layer that layer 7 obtains Figure, the input as the 8th layer of convolution.

9th layer of convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 512, step-length 2；

768 features are obtained together with 256 characteristic patterns that the second layer obtains are cascaded to the 9th layer of obtained convolutional layer Figure, the input as the tenth layer of convolution.

Eleventh floor convolutional layer, convolution kernel size 3*3, output characteristic figure quantity 3, step-length 1；

The form for resisting loss function is as follows：

Wherein, L represents loss function (loss function), and adv is subscript, represents confrontation (adversarial), by In confrontation loss function be using cross entropy in the form of carry out, in equation right end be cross entropy formula form, wherein E expression Expectation is taken, D represents the arbiter in our methods, and G is maker, and GD forms generation confrontation network altogether.In addition, we Purpose be generation video, in order to meet training requirement, it would be desirable to which input real video frame is exactly as reference data, X Represent real video frame images (quantity is more than 2), the part of intercalary delection is generated according to two frame video frame, so, in order to protect It is consistent to demonstrate,prove input and output, has obtained the video frame of the quantity as X in the way of weighting according to two frame video framePurpose Exactly allow maker G according toGo to generate the frame similar with X, that is, complete generating process.Due to the side using deep learning The neutral net of method, GD are neutral nets, therefore can be represented with a nonlinear function, so the D in formula, G All be considered as function, what is represented inside bracket is input data, be respectively X and

The result obtained only with confrontation loss simply has certain similitude with real image in pixel distribution, But it is not necessarily similar in the structure of image, in order to ensure there is similitude on the latter, we used mean square deviation loss Enhancing output result and the similitude of true picture are lost with gradient.The form of wherein the two loss functions is as follows：

Mean square deviation loss function is two norms of the difference of two data Y, X of input：

Gradient loss function is：

It is 2, X that p and α is set in the present invention_{I, j}WithWhat is represented is all the image of function input, because image is by pixel Point composition, therefore matrix can be mathematically considered as, i, j are the subscript of matrix respectively, this function is mainly to the adjacent picture of image Vegetarian refreshments makes the difference, and seeks norm, and then the norm of difference is being made the difference.Intuitively understanding, when Y is as X, above-mentioned formula is 0, When different, above-mentioned formula is not 0.It is the image that we generate, that is,So it is desirable that wish as far as possibleClose and X.

Twin network of growing up to be a useful person coordinate three above loss function we can obtain it is gem-pure as a result, still in image Contrast on still have some differences, therefore, we used another to normalize product associated loss function punishment output As a result picture contrast etc..Its form is as follows：

Wherein,, the image of X expression inputs, matrix form.The line number and columns of M, N representing matrix.Normalization product is related The scope of loss function is more similar closer to 1 representative image between 0-1, in order to so that it becomes the shape of loss function Formula, we have done it operation of taking the logarithm, and have added a negative sign, so export result closer to 0, representative image is related Property is bigger, and this form more suits the form of loss function.After putting up neutral net and choosing loss function, next It is exactly to train neutral net.After training 50 epoch of neutral net, network is already had according among two field pictures generation The ability of the multiple image of missing, and the result generated has higher quality.Associated losses functional form is as follows：

Loss=λ₁L_adv+λ₂L_mse+λ₃L_gdl+λ₄L_npcl

Given two frame video images, as the input of this method depth convolution generation network, before the input can to this two Image does linear interpolation processing (sampling) and obtains ten images, according to the following formula：

(1-r)*X₀+r*X_n+1

Wherein r is 10 uniform decimals between 0-1, has thus obtained ten input pictures.This ten images are made For the input of first maker, convolutional calculation is done according to convolutional layer, and exports ten new image Y ' that network calculations obtain, Y ' and the real image X together inputs as arbiter D1, and export and differentiate result y1 ∈ (0,1), y1 represents arbiter pair The evaluation of first maker generation result, the bigger generation result that represents is poorer, maker can constantly be adjusted according to y1 oneself with Generate more preferable result.In addition it is exactly input of the result of first maker as second maker, and passes through convolutional layer Convolutional calculation is done, obtains new generation result Y, then inputs of the Y with true picture X together as arbiter D2, and exports and sentences Other result y2 ∈ (0,1), y2 represents evaluation of the arbiter to second maker generation result, bigger to represent generation result more Difference, maker can constantly adjust oneself according to y2 to generate more preferable result.Then input X is replaced, constantly repeats such mistake Journey does training, until network possesses the ability that multiple true pictures are generated according to two images.At this time arbiter is not being needed Participation, it is only necessary to two maker networks can complete generation task.The step of being demonstrated according to Fig. 2, inputs to net Network two field pictures, after the calculating by two makers, network can generate 10 new video frame, and by this 12 Two field picture connects to form a video.The generation result that this method obtains is in Fig. 2 (a), Fig. 2 (b), Fig. 2 (c) and Fig. 2 (d) In done part show.And the quantity for the frame that needs generate can be controlled.We select ten images of generation, come with reference to result See, the algorithm that the present invention is studied can not only generate video frame true to nature, clear, coherent, and can generate or predict more More frames, can be widely used in the fields such as cartoon making, video generation, video interleave, video compress decompression, have extensive Application value.

In fact video generation has very big solution space, this means that neutral net is difficult in great solution space Go to find suitable solution, if in the case of without suitably constraint information is lacked, be very difficult to generate logical video sequence Row, the quality in addition generated are also very poor.The present invention proposes to be moved through among using two frames (X1, the Xk) generation with the time difference Journey image (X2 ..., Xk-1), we are constrained the solution of video generation using an image Xk parts as input, are described in Xk The motion state in target future in X1, therefore for generation task, Xk are the bound term for action generation, network it is defeated Going out can be as close as Xk.On the other hand it is a kind of resist about that we are also served as at the same time using confrontation network as training pattern Beam, it is as similar to input picture as possible using the sample for resisting network generation.In addition to solving Second Problem, we use The mode of dual training and the preferable generation quality of associated losses guarantee for taking a variety of different loss functions, and used ash Cross-correlation is spent as a kind of clarity of new loss function enhancing generation result.And it instead of conventional production network to only have The method of one maker, we use the series connection of two makers to be mainly used to as cascade maker, first maker Learn the action message of target in video by way of dual training, do not expect that generation quality is how well；Second generation Device can improve the quality of generation video on the basis of first maker.Compared with other methods, generation regards this method The very close real video of frequency, and the length of the video generated is far more than conventional method.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., should all include Within protection scope of the present invention.

Claims

A kind of 1. video generation method of the non-conterminous image of two frames based on deep learning, it is characterised in that including：

(1) linear interpolation processing is carried out to the non-conterminous image of two frames and obtains N frame input pictures, N frames input picture is inputted and is trained The first good maker, obtains the N frame video images between the non-conterminous image of two frames；

(2) N frame video images are inputted into trained second maker, obtains new N frame video images, and two frames are non-conterminous Image and new N frame video images connect generation video；

The training of first maker includes：It is deep to first using complete the first depth of convolution layer building own coding convolutional network Degree own coding convolutional network uses dual training, obtains trained first maker；The training of second maker includes： Using full convolutional layer and carry out parallel link the second depth own coding convolutional network of structure；To the second depth own coding convolutional network Using dual training, trained second maker is obtained.
2. a kind of video generation method of the non-conterminous image of two frames based on deep learning as claimed in claim 1, its feature It is, the training of first maker includes：

(S1) complete the first depth of convolution layer building own coding convolutional network is used, the non-conterminous sample of two frames is obtained from Sample video N frame true pictures in image and the non-conterminous sample image of two frames；

(S2) linear interpolation processing is carried out to the non-conterminous sample image of two frames and obtains the first depth of N frame samples input picture input certainly Convolutional network is encoded, the first depth own coding convolutional network is trained with the minimum target of loss function, obtains N frames first Training image, the first differentiation result is obtained by the first training image of N frames and N frames true picture input arbiter；

(S3) when the first differentiation result is more than threshold value, repeat step (S2), when the first differentiation result is less than or equal to threshold value, obtains To trained first maker.
3. a kind of video generation method of the non-conterminous image of two frames based on deep learning as claimed in claim 2, its feature It is, the training of second maker includes：

(T1) using full convolutional layer and carry out parallel link build the second depth own coding convolutional network；

(T2) the first training image of N frames is inputted into the second depth own coding convolutional network, with the minimum target of loss function to the Two depth own coding convolutional networks are trained, and obtain the second training image of N frames, and the second training image of N frames and N frames are truly schemed As input arbiter obtains the second differentiation result；

(T3) when the second differentiation result is more than threshold value, repeat step (T2), when the second differentiation result is less than or equal to threshold value, obtains To trained second maker.
4. a kind of video generation method of non-conterminous image of two frames based on deep learning as described in claim 1-3 is any, It is characterized in that, after every layer of convolutional layer in the first depth own coding convolutional network and the second depth own coding convolutional network One RELU nonlinear function is set.
5. a kind of video generation method of the non-conterminous image of two frames based on deep learning as claimed in claim 2 or claim 3, it is special Sign is that the arbiter includes 6 convolutional layers and a full articulamentum, and a normalization behaviour is set gradually after every layer of convolutional layer Make and a RELU nonlinear function.
6. a kind of video generation method of the non-conterminous image of two frames based on deep learning as claimed in claim 2 or claim 3, it is special Sign is that the loss function is：

Loss=λ₁L_adv+λ₂L_mse+λ₃L_gdl+λ₄L_npcl

Wherein, Loss is loss function, L_advTo resist loss function, λ₁To resist the weight of loss function, L_mseDamaged for mean square deviation Lose function, λ₂For the weight of mean square deviation loss function, L_gdlFor gradient loss function, λ₃For the weight of gradient loss function, L_npcl For normalization product associated loss function, λ₄For the weight of normalization product associated loss function.