CN114694081A

CN114694081A - Video sample generation method based on multivariate attribute synthesis

Info

Publication number: CN114694081A
Application number: CN202210423708.2A
Authority: CN
Inventors: 孙正兴; 骆守桐; 孙蕴瀚; 徐烨超
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-07-01

Abstract

The invention discloses a video sample generation method based on multivariate attribute synthesis, which comprises the following steps: constructing a multivariate attribute model: firstly, decomposing a multi-element static attribute and a dynamic attribute of a foreground object frame by frame from a video according to a pre-trained video encoder and an attribute decomposition network, and then respectively carrying out consistency and smoothness processing on the static attribute and the dynamic attribute according to consistency constraint and smoothness constraint; finally generating a multi-element attribute model through vector splicing; generating a multi-attribute embedding space: constructing a self-encoder based on a neural network, training the self-encoder and generating a multivariate attribute embedding space; and (3) multi-element attribute synthesis: sampling from a multi-element attribute embedding space, generating the multi-element attribute of each frame by utilizing a trained decoder, calculating the three-dimensional geometry and the texture of a target object of each frame by utilizing a shape synthesis method and an illumination synthesis method, and finally rendering to generate a video sample. Repeating the multivariate attribute synthesis process may result in multiple video samples specified by the user.

Description

Video sample generation method based on multivariate attribute synthesis

Technical Field

The invention relates to a video sample generation method, in particular to a video sample generation method based on multi-attribute synthesis.

Background

With the rapid development of the information age, video data is showing an explosive growth trend, and downstream applications based on video are increasingly demanded, such as video target detection, video prediction and the like. Thanks to the development of artificial intelligence, artificial intelligence technologies, represented by deep learning, have enjoyed great success in various applications, without leaving the support of a large number of training samples. However, despite the large amount of video data on the internet, collecting a wide variety of video samples remains a difficult task.

Although methods for synthesizing large amounts of video data have emerged in the year, for example, Yang C, Wang Z, Zhu X, et al, Pose-defined human video generation [ C ]// Proceedings of the European Conference reference on Computer Vision (ECCV) [ 2018:201- & 216 ], Dorkenwald M, Millich T, Blattmann A, et al, Stoechic image-to-video synthesis of the IEEE/CVF Conference reference [ C ]// Proceedings of the IEEE/CVF Conference reference on Computer Vision. 371: 3742- & 3753. However, these methods aim at improving the quality of the synthesized video and cannot meet the diversity requirements required for video samples.

The data diversity required by the video sample is mainly characterized in that the geometry, texture, illumination, posture and deformation of the target in the video are various. Most of the literature today focuses on solving the problem of diversity in image synthesis, such as: document 1, Z.Yu and C.Zhang, "Image based static interface expression with multiple deep network learning," in Proceedings of the 2015ACM on International Conference on Multi interaction. ACM,2015, pp.435-442, by traditional cutting, flipping and other enhancement modesSolving the problem of direction and scale diversity, document 2: zhou H, Liu J, Liu Z, et al, rotate-and-Render: Unvervised Photocosmetic Face Rotation from Single-View Images [ C]5911-5920. first, a single image is fitted by using a three-dimensional model, then the three-dimensional model is subjected to three-dimensional rotation and rendering to obtain a face image after initial rotation, and finally a face image after rotation with sense of reality is obtained through a GAN frame to solve the problem of posture diversity, wherein the document 3: goodfellow, J.Pouget-Abadie, M.Mirza, B.Xu, D.Warde-Farley, S.Ozair, A.Courville, and Y.Bengio, "genetic adaptation networks," in Advances in neural information processing systems,2014, pp.2672-2680. the problem of deformation diversity is solved by means of GAN. Document 4: m.shin, m.kim, and d. — s.kwon, "base cn structure analysis for facial expression recognition," in Robot and Human Interactive Communication (RO-MAN), 201625 th IEEE International Symposium on.ieee,2016, pp.724-729. decomposing a face image into a high frequency component and a low frequency component by fourier transform, and finally reconstructing into a new face image by inverse fourier transform, however, this method only improves the contrast of the image and cannot solve the problem of illumination diversity, document 5: von Li face illumination regularization method based on SFS algorithm [ D]2005, the SFS algorithm is used to decompose the face image into three components of albedo, normal and illumination, and the illumination component is replaced by a forward light source, and then reconstructed into the face image to realize image illumination regularization. Document 6 Wang C, Wang C, Xu C, et al. tag distributed generated adaptive networks for object image re-rendering [ C]2017, decomposing the face image into a series of labels of identity, posture, illumination and the like, and operating the illumination label to generate a new face image. Their method has some ability to generate pictures with diverse lighting. Some of the above methods are mainly directed to the expansion of image samples. In practice, video applications are often more than single image applicationsFor example, video monitoring, live webcasting, etc., it is necessary to provide an extension scheme for video data, and the existing video data extension schemes include: patent 1:Chen Li,gaoChinese, 10743444.7P]2019-12-24, however they simply fuse existing datasets and do not make a special design for diversity. The invention provides an expansion method of video data on the basis of the methods to meet the diversity requirement.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem of providing a video sample generation method based on multi-attribute synthesis aiming at the defects of the prior art.

In order to solve the technical problem, the invention discloses a video sample generation method based on multivariate attribute synthesis, which comprises the following steps:

step 1, constructing a multi-attribute model, wherein the method comprises the following steps: giving an initial video data set, preprocessing each video in the data set, and decomposing each video into a plurality of static attributes and dynamic attributes of a foreground object frame by frame according to a pre-trained video encoder and an attribute decomposition network; respectively carrying out consistency and smoothness processing on the multivariate static attribute and the dynamic attribute according to consistency constraint and smoothness constraint; generating a multivariate attribute model through vector splicing;

step 2, generating a multi-attribute embedding space, wherein the method comprises the following steps: constructing a self-coding network based on a neural network; mapping the multivariate attribute model constructed in the step 1 to a low-dimensional embedding space through an encoder in a self-coding network; a decoder in the self-coding network is restored into a multi-attribute model, and the self-coding network is trained through constrained reconstruction loss; generating a multi-attribute embedding space by recording the numerical value range of the low-dimensional embedding space;

step 3, synthesizing the multi-element attribute, wherein the method comprises the following steps: sampling from the multi-element attribute embedded space, and generating the multi-element attribute of each frame by using the decoder trained in the step 2; calculating the three-dimensional geometry and the texture of each frame of target object by using a shape synthesis method and an illumination synthesis method; rendering to generate a video sample; and repeating the process of synthesizing the multi-attribute to obtain a plurality of appointed video samples, and finally finishing the generation of the video samples synthesized based on the multi-attribute.

In the invention, the step 1 comprises the following steps:

step 1-1, video preprocessing;

step 1-2, decomposing video attributes;

step 1-3, preprocessing attributes;

1-4, splicing multiple attribute vectors;

and 1-5, repeating the steps 1-1 to 1-3 for each video in the given initial video data set until each video in the video data set is processed, and completing the construction of the multivariate attribute model.

In the invention, the video preprocessing method in the step 1-1 comprises the following steps:

utilizing a pre-trained target detector to carry out frame-by-frame target detection on the video, cutting out targets in the video according to a detected bounding box, and finally scaling each cut target to the size of 224 multiplied by 224 to obtain a processed video frame sequence I_iI is 1, …, T, i represents the ith frame, and T represents the total number of frames in the video.

The video attribute decomposition method in step 1-2 of the invention comprises the following steps:

utilizing a pre-trained video encoder to pre-process the video sequence I obtained in the step 1-1_iEncoding into frame-by-frame feature vectors f_i(ii) a The video encoder comprises a residual network of T shared parameters, whose input is a 224 x 224 three-channel image, and whose output is an n-dimensional feature vector;

decomposing video into frame-by-frame geometry f by using pre-trained attribute decomposition network_i ^gTexture f_i ^aPosture f_i ^pIllumination f_i ^lAnd deformation f_i ^m(ii) a The attribute decomposition network comprises 5 sub-networks, including: geometry estimation network, texture estimation network, attitude estimation network, illumination estimation network, and deformationEstimating a network; wherein the geometric estimation network is a single-layer fully-connected layer, the input of the geometric estimation network is n-dimensional feature vectors, and the output is m_gFeature vectors of dimensions for geometric estimation; the texture estimation network is a single-layer fully-connected layer, the input of the texture estimation network is n-dimensional feature vectors, and the output of the texture estimation network is m_aA feature vector of dimensions for texture estimation; the illumination estimation network is a single-layer full-connection layer, the input of the illumination estimation network is an n-dimensional feature vector, and the output of the illumination estimation network is a 27-dimensional feature vector for illumination estimation; the attitude estimation network is a single-layer full-connection layer, the input of the attitude estimation network is an n-dimensional characteristic vector, and the output of the attitude estimation network is a 6-dimensional characteristic vector for attitude estimation; the deformation estimation network is a single-layer full-connection layer, the input of the deformation estimation network is n-dimensional eigenvector, and the output of the deformation estimation network is m_gFeature vectors of dimensions for deformation estimation;

the attribute preprocessing method in the steps 1-3 of the invention comprises static attribute consistency processing and dynamic attribute smoothness processing;

the static attribute consistency processing method comprises the following steps: solving the consistency vector f by using the following objective function_con：

The frame-by-frame geometry f obtained in step 1-1_i ^hAnd texture f_i ^aRespectively input into a static attribute consistency processing method to obtain the geometry after consistency processing

And texture

The dynamic attribute smoothness processing method comprises the following steps: splicing the T column vectors into an n multiplied by T matrix in the column direction according to the time sequence; splitting the matrix by rows to obtain n vectors f 'of T dimensions'_jWhere j is 1, …, n, the following is performed for each vectorProcessing:

f″_j＝kernel*f′_j

where denotes a discrete convolution operation, kernel ═ 0.0545,0.244,0.403,0.244,0.0545]Denotes a one-dimensional convolution kernel, vector f'_jPerforming convolution operation by using kernel in a mode of step length being 1 to obtain a smoothed result f ″_j(ii) a Smoothing all n-dimensional vectors in the above manner, splicing the vectors back into an n multiplied by T matrix according to the sequence of rows, and splitting the matrix according to columns to obtain T smoothed n-dimensional column vectors; the frame-by-frame attitude f obtained in the step 1-1_i ^pIllumination f_i ^lAnd deformation f_i ^mInputting a dynamic attribute smoothness processing method to obtain a smoothed attitude

Illumination of light

And deformation of

The multi-attribute vector splicing method in the steps 1-4 of the invention comprises the following steps: for the geometry obtained from step 1-2

Texture

Posture

Illumination of light

And deformation of

And carrying out vector splicing operation so as to construct a multivariate attribute model:

wherein<·,·,…,·>Operation of stitching, S, representing a plurality of vectors_iAnd representing the result of splicing the multi-element attribute vectors of the ith frame.

In the invention, the step 2 comprises the following steps:

step 2-1, constructing a self-coding network;

step 2-2, self-coding training;

and 2-3, generating a multi-attribute embedding space.

The method for constructing the self-coding network in the step 2-1 comprises the following steps:

constructing a self-coding network based on a neural network, wherein the self-coding network comprises an encoder and a decoder;

wherein the encoder comprises m cell units sharing parameters, the τ th cell unit receiving the cell state C of the τ -1 st cell unit_τ-1Hidden state h_τ-1And current time input x_τAs an input; cell state C exported as the current cell_τAnd a hidden state h_τThe update rule of the self-coding network is as follows:

f_τ＝σ(W_f·<h_τ-1,x_τ>+b_f)

p_τ＝σ(W_p·<h_τ-1,x_τ>+b_p)

o_τ＝σ(W_o·<h_τ-1,x_τ>+b_o)

h_τ＝o_τ*tanh(C_τ)

wherein the content of the first and second substances,<·,·>representing the concatenation operation of two vectors, σ being a sigmoid function, tanh being a tanh function, (W)_f,b_f),(W_p,b_p),(W_C,b_C),(W_o,b_o) Respectively representing the weight and the offset of four different fully-connected layers; the input of the encoder is a multivariate attribute model, and the output is 1 t-dimensional characteristic code f_attr；

The decoder has a structure consistent with that of the dynamic mode coding network, and the input of the decoder is a t-dimensional vector f_attrThe output is a multivariate attribute model; in the update rule, the encoder's cell state C at time 1₁And hidden state h₁Set to all 0 vectors, input x at all times of the decoder_τAre all set to f_attrWhere τ is 1, …, m.

The self-coding training method in step 2-2 of the invention comprises the following steps:

training the self-coding network in the step 2-1 by adopting a random gradient descent method, wherein a loss function comprises three parts: a reconstruction loss l_sigrecA smoothing loss l_smoothAnd a loss of consistency l_consis(ii) a The reconstruction loss calculation method comprises the following steps:

wherein S is_iRepresents the ith vector of the multivariate attribute model,

representing the ith vector of the reconstructed multivariate attribute model;

the smoothing loss calculation method is as follows:

the calculation method of the multivariate attribute matrix S comprises the following steps: will rebuild much moreAll vectors in the MetaAttribute model

Superimposing a matrix in the column direction, [:]denotes slicing operation, [ m ]_g+m_a:,:]Indicates that the m < th > matrix is selected_a+m_gThe elements of the row to the last row,

representing the second derivative in the x-direction of the matrix;

the consistency loss is calculated as follows:

finally, the total loss of training for the entire network is as follows:

l＝λ₁l_sigrec+λ₂l_smoo+λ₃l_consis

wherein λ is₁,λ₂,λ₃Is a balance factor.

The method for generating the multi-attribute embedding space in the step 2-3 comprises the following steps:

feature vector generation using the encoder in step 2-1 for all multivariate attribute models

The coding range of each dimension is calculated and stored in two vectors r_l,r_hWherein, the vector r_lEach dimension records the minimum of the feature vector for that dimension, vector r_hEach dimension records the maximum value of the feature vector in the dimension, and the multi-attribute embedding space is composed of all vectors r meeting the following conditions: { r | r_l,α≤r_α≤r_h,αWhere, r is 1, …, t }, where_l,αIs represented by r_lValue of the alpha dimension, r_αValue representing the alpha dimension of the vector r, r_h,aIs represented by r_hThe value of the alpha dimension.

In the present invention, step 3 comprises:

step 3-1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in step 2-2

Step 3-2, generating dynamic attributes and static attributes: resolving static attribute part from multi-attribute model obtained by decoding

And dynamic Properties section

Further splitting it into geometry

Texture

Illumination of light

Posture

And deformation of

Generating representative static attributes, geometries

Texture

Step 3-3, synthesizing a geometric shape: tth frameThree-dimensional grid model

The calculation method of (2) is as follows:

wherein, G_avgThe method is an average geometric shape defined in a three-dimensional deformable model 3DMM, the dimensionality of the average geometric shape is 3n, and n represents the number of grid vertexes; each 3 dimensions represents x, y, z coordinate values of a three-dimensional vertex; g_neIs a neutral shape base defined in 3DMM and has the same dimension as G_avgEach 3 dimensions represents the offset of a three-dimensional vertex coordinate value; g_morIs a deformation base of the shape defined in 3DMM, the dimension and meaning are the same as G_ne(ii) a G obtained finally_meshThe vector is a vector with 3n of dimensionality, and is normalized into an n x 3 matrix through the heavy shape of the vector, wherein each row of the matrix represents a coordinate value of a model three-dimensional grid vertex;

step 3-4, texture synthesis illumination: firstly, synthesizing textures, and restoring a calculation formula of per-vertex color values as follows:

wherein, T_avgA vertex-by-vertex texture of average geometry defined by 3DMM with dimensions 3n representing the R, G, B values of a three-dimensional vertex every 3 dimensions; t is_neIs a texture base defined in 3DMM and has the same dimension as T_avgEach 3 dimensions represent the offset of R, G and B values of a three-dimensional vertex; the final T_meshThe vector is a vector with 3n of dimensionality, and is normalized into an n x 3 matrix through the heavy shape of the vector, wherein each row of the matrix represents the R, G and B values of the vertex of a three-dimensional mesh of a model;

and (3) synthesizing illumination: frame t illumination

Synthetic methodThe method comprises the following steps:

wherein, the first and the second end of the pipe are connected with each other,

a color value representing the ith vertex of the synthesized texture,

a value representing the jth dimension of the illumination of the tth frame, and Y (j) represents the jth base of the spherical harmonic; the operation of each vertex is as above to obtain the color of the synthesized illumination per-vertex texture

Step 3-5, rendering: the synthesized t-th frame three-dimensional geometric shape

Synthesized illuminated vertex-by-vertex texture

And attitude

Sending the image into a renderer to draw a t frame image I_tAnd mask image M of t-th frame_tRandomly cropping from another background picture and I_tImages I of the same size_bAnd synthesizing the final video frame, wherein the synthesizing method comprises the following steps:

wherein |, indicates element-by-element multiplication;

step 3-6, writing the generated T frame image into a video stream to obtain a final video sample;

and 3-7, repeating the steps 3-1 to 3-6 until the set number is reached, and finishing the generation of the video sample based on the multivariate attribute synthesis.

Has the advantages that:

a multivariate attribute model is built from the initial video data set to generate a multivariate attribute embedding space, and a new sample is generated in a multivariate attribute synthesis mode, so that the scale of the initial video data set is effectively expanded, and the diversity of the initial video data set is increased.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a schematic process flow diagram of the present invention.

Fig. 2 is a schematic diagram of a rendered picture artificially picking some feature vectors after embedding spatial random sampling from multivariate attributes.

Fig. 3 is a schematic diagram of a picture after illumination synthesis.

Fig. 4 is a mask diagram.

Fig. 5 is a schematic diagram of the generated picture.

FIG. 6 is a schematic diagram of another set of generated pictures.

Detailed Description

A video sample generation method based on multivariate attribute synthesis comprises the following steps:

the method comprises the following steps:

step 1-1, video preprocessing, the method comprising:

detection using pre-trained targetsThe video is subjected to frame-by-frame target detection by the video detector, targets in the video are cut out according to surrounding frames obtained by detection, and finally each cut target is scaled to the size of 224 multiplied by 224 to obtain a processed video frame sequence I_iI is 1, …, T, i represents the ith frame, and T represents the total number of frames in the video.

Step 1-2, decomposing video attributes; the method comprises the following steps:

decomposing video into frame-by-frame geometry f by using pre-trained attribute decomposition network_i ^gTexture f_i ^aPosture f_i ^pIllumination f_i ^lAnd deformation f_i ^m(ii) a The attribute decomposition network comprises 5 sub-networks, including: a geometric estimation network, a texture estimation network, an attitude estimation network, an illumination estimation network and a deformation estimation network; wherein the geometric estimation network is a single-layer fully-connected layer, the input of the geometric estimation network is n-dimensional feature vectors, and the output is m_gFeature vectors of dimensions for geometric estimation; the texture estimation network is a single-layer fully-connected layer, the input of the texture estimation network is n-dimensional feature vectors, and the output of the texture estimation network is m_aA feature vector of dimensions for texture estimation; the illumination estimation network is a single-layer full-connection layer, the input of the illumination estimation network is an n-dimensional characteristic vector, and the output of the illumination estimation network is a 27-dimensional characteristic vector for illumination estimation; the attitude estimation network is a single-layer full-connection layer, the input of the attitude estimation network is an n-dimensional characteristic vector, and the output of the attitude estimation network is a 6-dimensional characteristic vector for attitude estimation; the deformation estimation network is a single-layer full-connection layer, the input of the deformation estimation network is n-dimensional characteristic vectors, and the output of the deformation estimation network is m_gFeature vectors of dimensions for deformation estimation;

step 1-3, preprocessing attributes; the method comprises static attribute consistency processing and dynamic attribute smoothness processing;

the static attribute consistency processing method comprises the following steps: consider T n-dimensional vectors { f_iIf i ═ 1, …, T }, which expresses the same object, then the T n-dimensional vectors should be kept as consistent as possible, and the consistency vector f is solved using the following objective function_con：

The frame-by-frame geometry f obtained in step 1-1_i ^gAnd texture f_i ^aRespectively input into a static attribute consistency processing method to obtain geometry after consistency processing

And texture

The dynamic attribute smoothness processing method comprises the following steps: consider T n-dimensional column vectors { f_iIf i 1, …, T expresses the state of the same object at T consecutive times, then the T n-dimensional vectors should change as smoothly as possible. Splicing T column vectors into an n multiplied by T matrix in the column direction according to the time sequence; the matrix is divided according to lines to obtain n vectors f 'of T dimension'_jWhere j is 1, …, n, each vector is processed as follows:

f″_j＝kernel*f′_j

where denotes a discrete convolution operation, kernel ═ 0.0545,0.244,0.403,0.244,0.0545]Denotes a one-dimensional convolution kernel, vector f'_jPerforming convolution operation by using kernel in a mode of step length being 1 to obtain a smoothed result f ″_j(ii) a Smoothing all n-dimensional vectors in the above manner, splicing the vectors back into an n multiplied by T matrix according to the sequence of rows, and splitting the matrix according to columns to obtain T smoothed n-dimensional column vectors; the frame-by-frame attitude f obtained in the step 1-1_i ^pLight irradiation f_i ^lAnd deformation f_i ^mInputting a dynamic attribute smoothness processing method to obtain a smoothed attitude

Illumination of light

And deformation of

1-4, splicing multiple attribute vectors; the method comprises the following steps: for the geometry obtained from step 1-2

Texture

Posture

Illumination of light

And deformation of

Step 2, generating a multi-attribute embedding space, wherein the method comprises the following steps: constructing a self-coding network based on a neural network; mapping the multivariate attribute model constructed in the step 1 to a low-dimensional embedding space through an encoder in a self-coding network; a decoder in the self-coding network is restored into a multi-element attribute model, and the self-coding network is trained through constrained reconstruction loss; generating a multi-attribute embedding space by recording the numerical value range of the low-dimensional embedding space; the method comprises the following steps:

step 2-1, constructing a self-coding network; the method comprises the following steps:

wherein the encoder comprises m cell units sharing the parameter, the τ -th cell unit receives the cell state C of the τ -1 st cell unit_τ-1Hidden state h_τ-1And current time input x_τAs an input; cell state C exported as the current cell_τAnd a hidden state h_τThe update rule of the self-coding network is as follows:

f_τ＝σ(W_f·<h_τ-1,x_τ>+b_f)

p_τ＝σ(W_p·<h_τ-1,x_τ>+b_p)

o_τ＝σ(W_o·<h_τ-1,x_τ>+b_o)

h_τ＝o_τ*tanh(C_τ)

wherein the content of the first and second substances,<·,·>representing the concatenation operation of two vectors, σ being a sigmoid function, tanh being a tanh function, (W)_f,b_f),(W_p,b_p),(W_C,b_C),(W_o,b_o) Respectively representWeights and offsets of the same four fully connected layers; the input of the encoder is a multivariate attribute model, and the output is 1 t-dimensional characteristic code f_attr；

Step 2-2, self-coding training; the method comprises the following steps:

training the self-coding network in the step 2-1 by adopting a random gradient descent method, wherein a loss function comprises three parts: a reconstruction loss l_sigrecA smoothing loss l_smAnd a loss of consistency l_consis(ii) a The reconstruction loss calculation method comprises the following steps:

wherein S is_iRepresents the ith vector of the multivariate attribute model,

representing the ith vector of the reconstructed multivariate attribute model;

the smoothing loss calculation method is as follows:

the calculation method of the multivariate attribute matrix S comprises the following steps: all vectors in the reconstructed multivariate attribute model

representing the second derivative in the x-direction of the matrix;

the consistency loss is calculated as follows:

finally, the total loss of training for the entire network is as follows:

l＝λ₁l_sigrec+λ₂l_smoo+λ₃l_consis

wherein λ is₁,λ₂,λ₃Is a balance factor.

And 2-3, generating a multi-attribute embedding space. The method comprises the following steps:

Step 3, synthesizing the multi-element attribute, wherein the method comprises the following steps: sampling from the multi-element attribute embedded space, and generating the multi-element attribute of each frame by using the decoder trained in the step 2; calculating the three-dimensional geometry and the texture of each frame of target object by using a shape synthesis method and an illumination synthesis method; rendering to generate a video sample; and repeating the multi-attribute synthesis process to obtain a plurality of specified video samples, and finally finishing the generation of the video samples synthesized based on the multi-attribute.

And dynamic Properties section

Further splitting it into geometry

Texture

Illumination of light

Posture

And deformation of

Generating representative static attributes, geometries

Texture

Step 3-3, synthesizing a geometric shape: t frame three-dimensional grid model

The calculation method of (2) is as follows:

wherein G is_avgThe method is an average geometric shape defined in a three-dimensional deformable model 3DMM, the dimensionality of the average geometric shape is 3n, and n represents the number of grid vertexes; each 3 dimensions represents x, y, z coordinate values of a three-dimensional vertex; g_neIs a neutral shape base defined in 3DMM and has the same dimension as G_avgEach 3 dimensions represent the offset of a three-dimensional vertex coordinate value; g_morIs a deformation base of the shape defined in 3DMM, the dimension and meaning are the same as G_ne(ii) a G obtained finally_meshThe vector is a vector with 3n of dimensionality, and is normalized into an n x 3 matrix through the heavy shape of the vector, wherein each row of the matrix represents a coordinate value of a model three-dimensional grid vertex;

step 3-4, synthesizing texture illumination: firstly, synthesizing textures, and restoring a calculation formula of per-vertex color values as follows:

wherein, T_avgA vertex-by-vertex texture that is a 3d mm defined mean geometry with dimensions of 3n each 3 dimensions representing the R, G, B values of a three-dimensional vertex; t is_neIs a texture base defined in 3DMM and has the same dimension as T_avgEach 3 dimensions represent the offset of R, G and B values of a three-dimensional vertex; the T obtained finally_meshThe vector with 3n dimensionality is normalized into an n multiplied by 3 matrix through the heavy shape, wherein each row of the matrix represents the R, G and B values of the vertex of a model three-dimensional grid;

and (3) synthesizing illumination: frame t illumination

The synthesis method comprises the following steps:

wherein the content of the first and second substances,

a color value representing the ith vertex of the synthesized texture,

Synthesized illuminated vertex-by-vertex texture

And attitude

Sending into a renderer to draw a t frame image I_tAnd mask image M of t-th frame_tRandomly cropping from another background picture and I_tImages I of the same size_bAnd synthesizing the final video frame, wherein the synthesizing method comprises the following steps:

wherein |, indicates element-by-element multiplication;

3-6, writing the generated T frame image into a video stream to obtain a final video sample;

Examples

As shown in fig. 1, the method for generating a video sample based on multivariate attribute synthesis disclosed in the present invention is specifically implemented according to the following steps:

step 1, constructing a multivariate attribute model: firstly, decomposing multi-element static attributes (geometry, texture) and dynamic attributes (posture, illumination and deformation) of a foreground object frame by frame from a video according to a pre-trained video encoder and an attribute decomposition network, and then respectively carrying out consistency and smoothness processing on the static attributes and the dynamic attributes according to consistency constraint and smoothness constraint; and finally generating a multi-element attribute model through vector splicing.

Step 2, generating a multi-attribute embedding space: and (3) constructing a self-encoder based on a neural network, mapping the multi-element attribute model constructed in the step (1) to a low-dimensional embedding space through the encoder of the self-encoding network, reducing the multi-element attribute model into the multi-element attribute model through a decoder, and training the self-encoding network through constraint reconstruction loss. And generating the multi-attribute embedding space by recording the value range of the low-dimensional embedding space.

Step 3, multi-element attribute synthesis: sampling from a multi-element attribute embedding space, generating the multi-element attribute of each frame by utilizing a trained decoder, calculating the three-dimensional geometry and the texture of a target object of each frame by utilizing a shape synthesis method and an illumination synthesis method, and finally rendering to generate a video sample. Repeating the multivariate attribute synthesis process may result in multiple video samples specified by the user.

The main flow of each step is described in detail below

Wherein, step 1 includes the following steps:

step 1.1, decomposing video attributes: video I with pre-trained video encoder _i1, …, T is encoded as a frame-by-frame feature vector { f_iL 1, …, T, decomposing the frame-by-frame feature vector by using the attribute decomposition network to obtain the frame-by-frame geometry { f _i ^g1, …, T }, texture { f |_i ^a1, …, T, pose { f |_i ^p1, …, T }, illumination { f |_i ^l1, …, T |, and the deformation { f |_i ^m|i,…,T}。

Step 1.2, attribute preprocessing, considering T n-dimensional vectors { f }_iIf i is 1, …, T, which expresses the same object, then the T n-dimensional vectors should be kept as consistent as possible, and static attribute consistency processing is performed on the frame-by-frame geometry and texture to obtain consistency processed geometry

And texture

Consider T n-dimensional column vectors { f _i1, …, T, which expresses the state of the same object at T consecutive times, then the T n-dimensional vectors should change as smoothly as possible, and the frame-by-frame pose, illumination and deformation should be smoothed to obtain the smoothed pose

Illumination of light

And deformation of

Step 1.3, multi-element attribute vector splicing, for the geometry obtained from step 1.2

Texture

Posture

Illumination of light

And deformation of

Obtaining a multivariate attribute model by using vector splicing operation

And 1.4, repeating the steps 1-1 to 1-3 for each video in the original video data set until each video in the video data set is processed.

The step 2 comprises the following steps:

step 2.1, constructing a self-coding network, wherein the encoder comprises m cell units, the m cell units share parameters, and the ith cell unit receives the cell state C of the (i-1) th cell unit_i-1Hidden state h_i-1And current time input x_iAs input, and outputting the cell state C of the current cell_iAnd a hidden state h_iThe decoder has a structure consistent with a dynamic mode coding network, and is different in that the input of the decoder is a t-dimensional vector f_attrThe output is a multivariate attribute model. In the update rule, the difference is the cell state C of the encoder at time 1₁And hidden state h₁Set to the full 0 vector, input x at all times of the decoder_iI is 1, …, m is set as f_attr。

And 2.2, training a self-coding network, wherein the self-coding network constructed in the step 2.1 is trained by adopting a back propagation and random gradient descent method, and the reconstruction loss, the smoothing loss and the consistency loss are minimized.

Step 2.3, generating a multi-element attribute embedding space, and generating feature vectors for all multi-element attribute models by using the encoder in the step 2-1

Calculating the coding range of each dimension and storing the two vectorsr_l,r_hWherein r is_lEach dimension records the minimum value, r, of the feature vector in that dimension_hEach dimension records the maximum value of that dimension of the feature vector.

Step 3 comprises the following steps

Step 3.1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in the step 2.2

Step 3.2, generating dynamic attributes and static attributes, and disassembling static attribute parts from the multi-attribute model obtained by decoding

And dynamic Properties section

And further split it into geometries

Texture

Illumination of light

Posture

And deformation of

Next, representative static attributes, geometries, are generated

Texture of

And 3.3, synthesizing a geometric shape, and synthesizing the geometric shape and the deformation into a three-dimensional shape according to the 3DMM model.

Step 3.4, texture illumination is synthesized, the texture is synthesized into vertex-by-vertex color values according to the 3DMM model, and the vertex-by-vertex texture colors after the synthesized illumination are obtained according to the spherical harmonic illumination formula

Step 3.5, rendering, namely, synthesizing the t-th frame three-dimensional geometric shape

Synthesized illuminated vertex-by-vertex texture

And attitude

Sending into a renderer to draw a t frame rendering image I_tAnd mask image M of t-th frame_tRandomly cutting out sum I from another background picture_tBackground image I of the same size_bAnd blending the rendered image and the background image according to the mask image.

And 3.6, writing the generated T frame image into a video stream to obtain a final video sample.

And 3.7, repeating the steps 3.1 to 3.6 until the number of samples required by the user is met.

In this embodiment, a human face is taken as an example for explanation. The specific implementation process is as follows:

in the step 1, the video data set mainly adopts the video data set in the documents Nagrani A, Chung J S, Zisserman A.Voxceleb: a large-scale marker identification dataset [ J ]. arXIv prediction arXIv:1706.08612,2017. Then step 1.1 is performed for attribute decomposition. The residual network in step 1.1 is constructed in the manner described in the document He K, Zhang X, Ren S, et al. deep residual learning for image registration [ C ]// Proceedings of the IEEE reference on computer vision and pattern registration.2016: 770-. Video coding networks and attribute decomposition networks were pre-trained on the CelebA dataset in the manner described in the documents Deng Y, Yang J, Xu S, et al, accurate 3d face retrieval with well-featured learning From single image to image set [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition works.2019: 0-0. The multivariate property model is then constructed according to steps 1.2 and 1.3.

In step 2, the self-coding network is trained on the multi-element attribute model obtained in the step 1, and a multi-element attribute embedding space is generated. The random gradient descent method described in step 2.1 is carried out in the manner described in the Bottou L.Stochastic gradient device lots [ M ]// Neural networks: lots of the trade. Springer, Berlin, Heidelberg,2012: 421-.

To illustrate some intermediate results in step 3, the present invention is first shown by taking the generation of a single frame as an example. Sampling the multi-attribute embedded space generated in the second step, generating a representative static attribute according to the steps 3.1 and 3.2, setting deformation and illumination to be a full 0 vector, obtaining a human face three-dimensional model and vertex-by-vertex color values by using the steps 3.3 and 3.4, rendering the result of multiple sampling by using the rendering step of the step 3.5 to obtain a result shown in the figure 2, then generating illumination according to the steps 3.1 and 3.2, re-synthesizing the vertex-by-vertex color values according to the step 3.4, and obtaining the result shown in the figure 3 by using the rendering of the step 3.5. Fig. 4 is the corresponding mask image. Wherein the white area represents a face area and the black area represents a background area.

Then, the invention shows how to generate the video multiframes, firstly, the multi-attribute embedded space generated in the step 2 is sampled, representative static attributes and t dynamic attributes are generated according to the steps 3.1 and 3.2, then, the rendered image of each frame is obtained according to the steps 3.3,3.4 and 3.5 by utilizing the static attributes and the dynamic attributes at each moment, and as the number of the rendered video frames is more, the invention takes one picture for each 3 frames and splices to obtain the image as shown in the figure 5. Then, the multi-attribute embedded space generated in the step 2 is sampled, representative static attributes and t dynamic attributes are generated according to the steps 3.1 and 3.2, the representative attributes of the second time are replaced by the representative attributes obtained after the first sampling calculation, a rendered image of each frame is obtained according to the steps 3.3,3.4 and 3.5, a picture is taken from each 3 frames, and the images shown in the figure 6 are obtained by splicing. Fig. 6 shows a new video segment with frames at a sampling interval of 3. And the identity of the person in fig. 6 and 5 is consistent, with the difference being the lighting, deformation, and pose.

3.3,3.4 the 3DMM model is the model described in Paysan P, Knothe R, Amberg B, et al.A 3D surface model for position and estimation innovative surface registration [ C ]//2009six IEEE international conference on advanced video and signal based scientific. Ieee,2009:296-301.

The present invention provides a method and a system for generating a video sample based on multivariate attribute synthesis, and a plurality of methods and approaches for implementing the technical solution are provided, the above description is only a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A video sample generation method based on multivariate attribute synthesis is characterized by comprising the following steps:

step 3, synthesizing the multivariate attribute, wherein the method comprises the following steps: sampling from the multi-element attribute embedded space, and generating the multi-element attribute of each frame by using the decoder trained in the step 2; calculating the three-dimensional geometry and the texture of each frame of target object by using a shape synthesis method and an illumination synthesis method; rendering to generate a video sample; and repeating the process of synthesizing the multi-attribute to obtain a plurality of appointed video samples, and finally finishing the generation of the video samples synthesized based on the multi-attribute.

2. The method for generating a video sample based on multivariate attribute synthesis as claimed in claim 1, wherein the step 1 comprises the following steps:

step 1-1, video preprocessing, the method comprising:

utilizing a pre-trained target detector to carry out frame-by-frame target detection on the video, cutting out targets in the video according to a detected bounding box, and finally scaling each cut target to the size of 224 multiplied by 224 to obtain a processed video frame sequence I_i1, where T, i denotes the ith frame and T denotes the total number of frames in the video;

step 1-2, decomposing video attributes;

step 1-3, preprocessing attributes;

1-4, splicing multiple attribute vectors;

3. The method according to claim 2, wherein the video attribute decomposition method in step 1-2 comprises:

decomposing video into frame-by-frame geometry f by using pre-trained attribute decomposition network_i ^gTexture f_i ^aPosture f_i ^pIllumination f_i ^lAnd deformation f_i ^m(ii) a The attribute decomposition network comprises 5 sub-networks, including: a geometric estimation network, a texture estimation network, an attitude estimation network, an illumination estimation network and a deformation estimation network; wherein the geometry estimation network is a single-layer fully-connected layer, the input of the geometry estimation network is n-dimensional feature vectors, and the output of the geometry estimation network is m_gFeature vectors of dimensions for geometric estimation; the texture estimation network is a single-layer fully-connected layer, the input of the texture estimation network is n-dimensional feature vectors, and the output of the texture estimation network is m_aA feature vector of dimensions for texture estimation; the illumination estimation network is a single-layer full-connection layer, the input of the illumination estimation network is an n-dimensional feature vector, and the output of the illumination estimation network is a 27-dimensional feature vector for illumination estimation; the attitude estimation network is a single-layer full-connection layer, the input of the attitude estimation network is an n-dimensional characteristic vector, and the output of the attitude estimation network is a 6-dimensional characteristic vector for attitude estimation; the deformation estimation network is a single-layer full-connection layer, the input of the deformation estimation network is n-dimensional eigenvector, and the output of the deformation estimation network is m_gFeature vectors of dimensions for deformation estimation.

4. The method for generating a video sample based on multivariate attribute synthesis as claimed in claim 3, wherein the attribute preprocessing method in step 1-3 comprises static attribute consistency processing and dynamic attribute smoothness processing;

The frame-by-frame geometry f obtained in step 1-1_i ^gAnd texture f_i ^aRespectively input into a static attribute consistency processing method to obtain the geometry after consistency processing

And texture

The dynamic attribute smoothness processing method comprises the following steps: splicing the T column vectors into an n multiplied by T matrix in the column direction according to the time sequence; the matrix is divided according to lines to obtain n vectors f 'of T dimension'_jWherein j 1.. n, each vector is processed as follows:

f″_j＝kernel*f_j′

where denotes a discrete convolution operation, kernel ═ 0.0545,0.244,0.403,0.244,0.0545]Representing a one-dimensional convolution kernel, vector f_j' obtaining a smoothed result f by performing convolution operation with kernel in a manner of step size 1_j"; after smoothing all the n-dimensional vectors in the mode, splicing the vectors back into an n multiplied by T matrix according to the sequence of rows, and splitting the matrix according to columns to obtain T smoothed n-dimensional column vectors; the frame-by-frame attitude f obtained in the step 1-1_i ^pIllumination f_i ^lAnd deformation f_i ^mInputting a dynamic attribute smoothness processing method to obtain a smoothed attitude

Illumination of light

And deformation of

5. The method according to claim 4, wherein the method for generating the video sample based on multivariate attribute synthesis in steps 1-4 comprises: for the geometry obtained from step 1-2

Texture

Posture

Illumination of light

And deformation of

And performing vector splicing operation to construct a multivariate attribute model:

wherein<·，·，...，·>Operation of stitching, S, representing a plurality of vectors_iAnd representing the result of splicing the multi-element attribute vectors of the ith frame.

6. The method according to claim 5, wherein step 2 comprises:

step 2-1, constructing a self-coding network;

step 2-2, self-coding training;

and 2-3, generating a multi-attribute embedding space.

7. The method according to claim 6, wherein the method for constructing the self-coding network in step 2-1 comprises:

f_τ＝σ(W_f·<h_τ-1，x_τ>+b_f)

p_τ＝σ(W_p·<h_τ-1，x_τ>+b_p)

o_τ＝σ(W_o·<h_τ-1，x_τ>+b_o)

h_τ＝o_τ*tanh(C_τ)

wherein the content of the first and second substances,<·，·>represents the splicing operation of two vectors, sigma is sigmoid function, tanh is tanh function, (W)_f，b_f)，(W_p，b_p)，(W_C，b_C)，(W_o，b_o) Respectively representing the weight and the offset of four different fully-connected layers; the input of the encoder is a multivariate attribute model, and the output is 1 t-dimensional characteristic code f_attr；

The decoder structure and dynamic mode coding networkThe input to the decoder is a t-dimensional vector f_attrThe output is a multivariate attribute model; in the update rule, the encoder's cell state C at time 1₁And hidden state h₁Set to all 0 vectors, input x at all times of the decoder_τAre all set to f_attrWhere τ is 1.

8. The method according to claim 7, wherein the self-coding training method in step 2-2 comprises:

wherein S is_iRepresents the ith vector of the multivariate attribute model,

representing the ith vector of the reconstructed multivariate attribute model;

the smoothing loss calculation method is as follows:

Superimposed in the column direction as a matrix, [: ,:]denotes slicing operation, [ m ]_g+m_a：，：]Indicates that the m < th > matrix is selected_a+m_gThe elements of the row to the last row,

representing the second derivative in the x direction of the matrix;

the consistency loss is calculated as follows:

finally, the total loss of training for the entire network is as follows:

l＝λ₁l_sigrec+λ₂l_smooth+λ₃l_consis

wherein λ is₁，λ₂，λ₃Is a balance factor.

9. The method according to claim 8, wherein the method for generating the multivariate attribute-based composite video sample in step 2-3 comprises:

The coding range of each dimension is calculated and stored in two vectors r_l，r_hWherein, the vector r_lEach dimension records the minimum of the feature vector for that dimension, vector r_hEach dimension records the maximum value of the feature vector in the dimension, and the multi-attribute embedding space is composed of all vectors r meeting the following conditions: { r | r_l，α≤r_α≤r_h，α1, a, t }, wherein r_l，αIs represented by r_lValue of the alpha dimension, r_αValue representing the alpha dimension of the vector r, r_h，αIs represented by r_hThe value of the alpha dimension.

10. The method according to claim 9, wherein step 3 comprises:

And dynamic Properties section

Further splitting it into geometries

Texture

Illumination of light

Posture

And deformation of

Generating representative static attributes, geometries

Texture

Step 3-3, synthesizing a geometric shape: t frame three-dimensional grid model

The calculation method of (2) is as follows:

wherein G is_avgThe method is an average geometric shape defined in a three-dimensional deformable model 3DMM, the dimensionality of the average geometric shape is 3n, and n represents the number of grid vertexes; each 3 dimensions represent x, y and z coordinate values of a three-dimensional vertex; g_neIs a neutral shape group as defined in 3DMM and has the same dimension as G_avgEach 3 dimensions represents the offset of a three-dimensional vertex coordinate value; g_morIs a deformation base of the shape defined in 3DMM, the dimension and meaning are the same as G_ne(ii) a G obtained finally_meshThe vector with 3n dimensionality is normalized into an n multiplied by 3 matrix through the heavy shape, wherein each row of the matrix represents a coordinate value of a model three-dimensional grid vertex;

wherein, T_avgA vertex-by-vertex texture of average geometry defined by 3DMM with dimensions 3n representing the R, G, B values of a three-dimensional vertex every 3 dimensions; t is a unit of_neIs a texture base defined in 3DMM and has the same dimension as T_avgEach 3 dimensions represent the offset of R, G and B values of a three-dimensional vertex; the final T_meshIs a vector of dimension 3n, which is normalized by reshaping it into an n x 3 matrix, where each row of the matrix is represented by a weightRepresenting the R, G and B values of the vertexes of a three-dimensional mesh of the model;

and (3) synthesizing illumination: frame t illumination

The synthesis method comprises the following steps:

wherein the content of the first and second substances,

a color value representing the ith vertex of the synthesized texture,

Synthesized illuminated per-vertex texture

And attitude

wherein |, indicates element-by-element multiplication;