CN114694081A - Video sample generation method based on multivariate attribute synthesis - Google Patents
Video sample generation method based on multivariate attribute synthesis Download PDFInfo
- Publication number
- CN114694081A CN114694081A CN202210423708.2A CN202210423708A CN114694081A CN 114694081 A CN114694081 A CN 114694081A CN 202210423708 A CN202210423708 A CN 202210423708A CN 114694081 A CN114694081 A CN 114694081A
- Authority
- CN
- China
- Prior art keywords
- attribute
- video
- frame
- dimensional
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 27
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 27
- 239000013598 vector Substances 0.000 claims abstract description 156
- 238000005286 illumination Methods 0.000 claims abstract description 69
- 230000003068 static effect Effects 0.000 claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000009877 rendering Methods 0.000 claims abstract description 15
- 238000005070 sampling Methods 0.000 claims abstract description 14
- 238000001308 synthesis method Methods 0.000 claims abstract description 12
- 230000008569 process Effects 0.000 claims abstract description 10
- 238000013528 artificial neural network Methods 0.000 claims abstract description 9
- 239000011159 matrix material Substances 0.000 claims description 36
- 230000002194 synthesizing effect Effects 0.000 claims description 25
- 239000010410 layer Substances 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 16
- 239000002356 single layer Substances 0.000 claims description 15
- 238000007781 pre-processing Methods 0.000 claims description 13
- 238000003672 processing method Methods 0.000 claims description 12
- 238000009499 grossing Methods 0.000 claims description 10
- 238000001514 detection method Methods 0.000 claims description 6
- 238000011478 gradient descent method Methods 0.000 claims description 5
- 239000000126 substance Substances 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 230000007935 neutral effect Effects 0.000 claims description 3
- 239000002131 composite material Substances 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Pure & Applied Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video sample generation method based on multivariate attribute synthesis, which comprises the following steps: constructing a multivariate attribute model: firstly, decomposing a multi-element static attribute and a dynamic attribute of a foreground object frame by frame from a video according to a pre-trained video encoder and an attribute decomposition network, and then respectively carrying out consistency and smoothness processing on the static attribute and the dynamic attribute according to consistency constraint and smoothness constraint; finally generating a multi-element attribute model through vector splicing; generating a multi-attribute embedding space: constructing a self-encoder based on a neural network, training the self-encoder and generating a multivariate attribute embedding space; and (3) multi-element attribute synthesis: sampling from a multi-element attribute embedding space, generating the multi-element attribute of each frame by utilizing a trained decoder, calculating the three-dimensional geometry and the texture of a target object of each frame by utilizing a shape synthesis method and an illumination synthesis method, and finally rendering to generate a video sample. Repeating the multivariate attribute synthesis process may result in multiple video samples specified by the user.
Description
Technical Field
The invention relates to a video sample generation method, in particular to a video sample generation method based on multi-attribute synthesis.
Background
With the rapid development of the information age, video data is showing an explosive growth trend, and downstream applications based on video are increasingly demanded, such as video target detection, video prediction and the like. Thanks to the development of artificial intelligence, artificial intelligence technologies, represented by deep learning, have enjoyed great success in various applications, without leaving the support of a large number of training samples. However, despite the large amount of video data on the internet, collecting a wide variety of video samples remains a difficult task.
Although methods for synthesizing large amounts of video data have emerged in the year, for example, Yang C, Wang Z, Zhu X, et al, Pose-defined human video generation [ C ]// Proceedings of the European Conference reference on Computer Vision (ECCV) [ 2018:201- & 216 ], Dorkenwald M, Millich T, Blattmann A, et al, Stoechic image-to-video synthesis of the IEEE/CVF Conference reference [ C ]// Proceedings of the IEEE/CVF Conference reference on Computer Vision. 371: 3742- & 3753. However, these methods aim at improving the quality of the synthesized video and cannot meet the diversity requirements required for video samples.
The data diversity required by the video sample is mainly characterized in that the geometry, texture, illumination, posture and deformation of the target in the video are various. Most of the literature today focuses on solving the problem of diversity in image synthesis, such as: document 1, Z.Yu and C.Zhang, "Image based static interface expression with multiple deep network learning," in Proceedings of the 2015ACM on International Conference on Multi interaction. ACM,2015, pp.435-442, by traditional cutting, flipping and other enhancement modesSolving the problem of direction and scale diversity, document 2: zhou H, Liu J, Liu Z, et al, rotate-and-Render: Unvervised Photocosmetic Face Rotation from Single-View Images [ C]5911-5920. first, a single image is fitted by using a three-dimensional model, then the three-dimensional model is subjected to three-dimensional rotation and rendering to obtain a face image after initial rotation, and finally a face image after rotation with sense of reality is obtained through a GAN frame to solve the problem of posture diversity, wherein the document 3: goodfellow, J.Pouget-Abadie, M.Mirza, B.Xu, D.Warde-Farley, S.Ozair, A.Courville, and Y.Bengio, "genetic adaptation networks," in Advances in neural information processing systems,2014, pp.2672-2680. the problem of deformation diversity is solved by means of GAN. Document 4: m.shin, m.kim, and d. — s.kwon, "base cn structure analysis for facial expression recognition," in Robot and Human Interactive Communication (RO-MAN), 201625 th IEEE International Symposium on.ieee,2016, pp.724-729. decomposing a face image into a high frequency component and a low frequency component by fourier transform, and finally reconstructing into a new face image by inverse fourier transform, however, this method only improves the contrast of the image and cannot solve the problem of illumination diversity, document 5: von Li face illumination regularization method based on SFS algorithm [ D]2005, the SFS algorithm is used to decompose the face image into three components of albedo, normal and illumination, and the illumination component is replaced by a forward light source, and then reconstructed into the face image to realize image illumination regularization. Document 6 Wang C, Wang C, Xu C, et al. tag distributed generated adaptive networks for object image re-rendering [ C]2017, decomposing the face image into a series of labels of identity, posture, illumination and the like, and operating the illumination label to generate a new face image. Their method has some ability to generate pictures with diverse lighting. Some of the above methods are mainly directed to the expansion of image samples. In practice, video applications are often more than single image applicationsFor example, video monitoring, live webcasting, etc., it is necessary to provide an extension scheme for video data, and the existing video data extension schemes include: patent 1:Chen Li,gaoChinese, 10743444.7P]2019-12-24, however they simply fuse existing datasets and do not make a special design for diversity. The invention provides an expansion method of video data on the basis of the methods to meet the diversity requirement.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem of providing a video sample generation method based on multi-attribute synthesis aiming at the defects of the prior art.
In order to solve the technical problem, the invention discloses a video sample generation method based on multivariate attribute synthesis, which comprises the following steps:
step 2, generating a multi-attribute embedding space, wherein the method comprises the following steps: constructing a self-coding network based on a neural network; mapping the multivariate attribute model constructed in the step 1 to a low-dimensional embedding space through an encoder in a self-coding network; a decoder in the self-coding network is restored into a multi-attribute model, and the self-coding network is trained through constrained reconstruction loss; generating a multi-attribute embedding space by recording the numerical value range of the low-dimensional embedding space;
step 3, synthesizing the multi-element attribute, wherein the method comprises the following steps: sampling from the multi-element attribute embedded space, and generating the multi-element attribute of each frame by using the decoder trained in the step 2; calculating the three-dimensional geometry and the texture of each frame of target object by using a shape synthesis method and an illumination synthesis method; rendering to generate a video sample; and repeating the process of synthesizing the multi-attribute to obtain a plurality of appointed video samples, and finally finishing the generation of the video samples synthesized based on the multi-attribute.
In the invention, the step 1 comprises the following steps:
step 1-1, video preprocessing;
step 1-2, decomposing video attributes;
step 1-3, preprocessing attributes;
1-4, splicing multiple attribute vectors;
and 1-5, repeating the steps 1-1 to 1-3 for each video in the given initial video data set until each video in the video data set is processed, and completing the construction of the multivariate attribute model.
In the invention, the video preprocessing method in the step 1-1 comprises the following steps:
utilizing a pre-trained target detector to carry out frame-by-frame target detection on the video, cutting out targets in the video according to a detected bounding box, and finally scaling each cut target to the size of 224 multiplied by 224 to obtain a processed video frame sequence IiI is 1, …, T, i represents the ith frame, and T represents the total number of frames in the video.
The video attribute decomposition method in step 1-2 of the invention comprises the following steps:
utilizing a pre-trained video encoder to pre-process the video sequence I obtained in the step 1-1iEncoding into frame-by-frame feature vectors fi(ii) a The video encoder comprises a residual network of T shared parameters, whose input is a 224 x 224 three-channel image, and whose output is an n-dimensional feature vector;
decomposing video into frame-by-frame geometry f by using pre-trained attribute decomposition networki gTexture fi aPosture fi pIllumination fi lAnd deformation fi m(ii) a The attribute decomposition network comprises 5 sub-networks, including: geometry estimation network, texture estimation network, attitude estimation network, illumination estimation network, and deformationEstimating a network; wherein the geometric estimation network is a single-layer fully-connected layer, the input of the geometric estimation network is n-dimensional feature vectors, and the output is mgFeature vectors of dimensions for geometric estimation; the texture estimation network is a single-layer fully-connected layer, the input of the texture estimation network is n-dimensional feature vectors, and the output of the texture estimation network is maA feature vector of dimensions for texture estimation; the illumination estimation network is a single-layer full-connection layer, the input of the illumination estimation network is an n-dimensional feature vector, and the output of the illumination estimation network is a 27-dimensional feature vector for illumination estimation; the attitude estimation network is a single-layer full-connection layer, the input of the attitude estimation network is an n-dimensional characteristic vector, and the output of the attitude estimation network is a 6-dimensional characteristic vector for attitude estimation; the deformation estimation network is a single-layer full-connection layer, the input of the deformation estimation network is n-dimensional eigenvector, and the output of the deformation estimation network is mgFeature vectors of dimensions for deformation estimation;
the attribute preprocessing method in the steps 1-3 of the invention comprises static attribute consistency processing and dynamic attribute smoothness processing;
the static attribute consistency processing method comprises the following steps: solving the consistency vector f by using the following objective functioncon:
The frame-by-frame geometry f obtained in step 1-1i hAnd texture fi aRespectively input into a static attribute consistency processing method to obtain the geometry after consistency processingAnd texture
The dynamic attribute smoothness processing method comprises the following steps: splicing the T column vectors into an n multiplied by T matrix in the column direction according to the time sequence; splitting the matrix by rows to obtain n vectors f 'of T dimensions'jWhere j is 1, …, n, the following is performed for each vectorProcessing:
f″j=kernel*f′j
where denotes a discrete convolution operation, kernel ═ 0.0545,0.244,0.403,0.244,0.0545]Denotes a one-dimensional convolution kernel, vector f'jPerforming convolution operation by using kernel in a mode of step length being 1 to obtain a smoothed result f ″j(ii) a Smoothing all n-dimensional vectors in the above manner, splicing the vectors back into an n multiplied by T matrix according to the sequence of rows, and splitting the matrix according to columns to obtain T smoothed n-dimensional column vectors; the frame-by-frame attitude f obtained in the step 1-1i pIllumination fi lAnd deformation fi mInputting a dynamic attribute smoothness processing method to obtain a smoothed attitudeIllumination of lightAnd deformation of
The multi-attribute vector splicing method in the steps 1-4 of the invention comprises the following steps: for the geometry obtained from step 1-2TexturePostureIllumination of lightAnd deformation ofAnd carrying out vector splicing operation so as to construct a multivariate attribute model:
wherein<·,·,…,·>Operation of stitching, S, representing a plurality of vectorsiAnd representing the result of splicing the multi-element attribute vectors of the ith frame.
In the invention, the step 2 comprises the following steps:
step 2-1, constructing a self-coding network;
step 2-2, self-coding training;
and 2-3, generating a multi-attribute embedding space.
The method for constructing the self-coding network in the step 2-1 comprises the following steps:
constructing a self-coding network based on a neural network, wherein the self-coding network comprises an encoder and a decoder;
wherein the encoder comprises m cell units sharing parameters, the τ th cell unit receiving the cell state C of the τ -1 st cell unitτ-1Hidden state hτ-1And current time input xτAs an input; cell state C exported as the current cellτAnd a hidden state hτThe update rule of the self-coding network is as follows:
fτ=σ(Wf·<hτ-1,xτ>+bf)
pτ=σ(Wp·<hτ-1,xτ>+bp)
oτ=σ(Wo·<hτ-1,xτ>+bo)
hτ=oτ*tanh(Cτ)
wherein the content of the first and second substances,<·,·>representing the concatenation operation of two vectors, σ being a sigmoid function, tanh being a tanh function, (W)f,bf),(Wp,bp),(WC,bC),(Wo,bo) Respectively representing the weight and the offset of four different fully-connected layers; the input of the encoder is a multivariate attribute model, and the output is 1 t-dimensional characteristic code fattr;
The decoder has a structure consistent with that of the dynamic mode coding network, and the input of the decoder is a t-dimensional vector fattrThe output is a multivariate attribute model; in the update rule, the encoder's cell state C at time 11And hidden state h1Set to all 0 vectors, input x at all times of the decoderτAre all set to fattrWhere τ is 1, …, m.
The self-coding training method in step 2-2 of the invention comprises the following steps:
training the self-coding network in the step 2-1 by adopting a random gradient descent method, wherein a loss function comprises three parts: a reconstruction loss lsigrecA smoothing loss lsmoothAnd a loss of consistency lconsis(ii) a The reconstruction loss calculation method comprises the following steps:
wherein S isiRepresents the ith vector of the multivariate attribute model,representing the ith vector of the reconstructed multivariate attribute model;
the smoothing loss calculation method is as follows:
the calculation method of the multivariate attribute matrix S comprises the following steps: will rebuild much moreAll vectors in the MetaAttribute modelSuperimposing a matrix in the column direction, [:]denotes slicing operation, [ m ]g+ma:,:]Indicates that the m < th > matrix is selecteda+mgThe elements of the row to the last row,representing the second derivative in the x-direction of the matrix;
the consistency loss is calculated as follows:
finally, the total loss of training for the entire network is as follows:
l=λ1lsigrec+λ2lsmoo+λ3lconsis
wherein λ is1,λ2,λ3Is a balance factor.
The method for generating the multi-attribute embedding space in the step 2-3 comprises the following steps:
feature vector generation using the encoder in step 2-1 for all multivariate attribute modelsThe coding range of each dimension is calculated and stored in two vectors rl,rhWherein, the vector rlEach dimension records the minimum of the feature vector for that dimension, vector rhEach dimension records the maximum value of the feature vector in the dimension, and the multi-attribute embedding space is composed of all vectors r meeting the following conditions: { r | rl,α≤rα≤rh,αWhere, r is 1, …, t }, wherel,αIs represented by rlValue of the alpha dimension, rαValue representing the alpha dimension of the vector r, rh,aIs represented by rhThe value of the alpha dimension.
In the present invention, step 3 comprises:
step 3-1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in step 2-2
Step 3-2, generating dynamic attributes and static attributes: resolving static attribute part from multi-attribute model obtained by decodingAnd dynamic Properties sectionFurther splitting it into geometryTextureIllumination of lightPostureAnd deformation ofGenerating representative static attributes, geometriesTexture
Step 3-3, synthesizing a geometric shape: tth frameThree-dimensional grid modelThe calculation method of (2) is as follows:
wherein, GavgThe method is an average geometric shape defined in a three-dimensional deformable model 3DMM, the dimensionality of the average geometric shape is 3n, and n represents the number of grid vertexes; each 3 dimensions represents x, y, z coordinate values of a three-dimensional vertex; gneIs a neutral shape base defined in 3DMM and has the same dimension as GavgEach 3 dimensions represents the offset of a three-dimensional vertex coordinate value; gmorIs a deformation base of the shape defined in 3DMM, the dimension and meaning are the same as Gne(ii) a G obtained finallymeshThe vector is a vector with 3n of dimensionality, and is normalized into an n x 3 matrix through the heavy shape of the vector, wherein each row of the matrix represents a coordinate value of a model three-dimensional grid vertex;
step 3-4, texture synthesis illumination: firstly, synthesizing textures, and restoring a calculation formula of per-vertex color values as follows:
wherein, TavgA vertex-by-vertex texture of average geometry defined by 3DMM with dimensions 3n representing the R, G, B values of a three-dimensional vertex every 3 dimensions; t isneIs a texture base defined in 3DMM and has the same dimension as TavgEach 3 dimensions represent the offset of R, G and B values of a three-dimensional vertex; the final TmeshThe vector is a vector with 3n of dimensionality, and is normalized into an n x 3 matrix through the heavy shape of the vector, wherein each row of the matrix represents the R, G and B values of the vertex of a three-dimensional mesh of a model;
and (3) synthesizing illumination: frame t illuminationSynthetic methodThe method comprises the following steps:
wherein, the first and the second end of the pipe are connected with each other,a color value representing the ith vertex of the synthesized texture,a value representing the jth dimension of the illumination of the tth frame, and Y (j) represents the jth base of the spherical harmonic; the operation of each vertex is as above to obtain the color of the synthesized illumination per-vertex texture
Step 3-5, rendering: the synthesized t-th frame three-dimensional geometric shapeSynthesized illuminated vertex-by-vertex textureAnd attitudeSending the image into a renderer to draw a t frame image ItAnd mask image M of t-th frametRandomly cropping from another background picture and ItImages I of the same sizebAnd synthesizing the final video frame, wherein the synthesizing method comprises the following steps:
wherein |, indicates element-by-element multiplication;
step 3-6, writing the generated T frame image into a video stream to obtain a final video sample;
and 3-7, repeating the steps 3-1 to 3-6 until the set number is reached, and finishing the generation of the video sample based on the multivariate attribute synthesis.
Has the advantages that:
a multivariate attribute model is built from the initial video data set to generate a multivariate attribute embedding space, and a new sample is generated in a multivariate attribute synthesis mode, so that the scale of the initial video data set is effectively expanded, and the diversity of the initial video data set is increased.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic process flow diagram of the present invention.
Fig. 2 is a schematic diagram of a rendered picture artificially picking some feature vectors after embedding spatial random sampling from multivariate attributes.
Fig. 3 is a schematic diagram of a picture after illumination synthesis.
Fig. 4 is a mask diagram.
Fig. 5 is a schematic diagram of the generated picture.
FIG. 6 is a schematic diagram of another set of generated pictures.
Detailed Description
A video sample generation method based on multivariate attribute synthesis comprises the following steps:
the method comprises the following steps:
step 1-1, video preprocessing, the method comprising:
detection using pre-trained targetsThe video is subjected to frame-by-frame target detection by the video detector, targets in the video are cut out according to surrounding frames obtained by detection, and finally each cut target is scaled to the size of 224 multiplied by 224 to obtain a processed video frame sequence IiI is 1, …, T, i represents the ith frame, and T represents the total number of frames in the video.
Step 1-2, decomposing video attributes; the method comprises the following steps:
utilizing a pre-trained video encoder to pre-process the video sequence I obtained in the step 1-1iEncoding into frame-by-frame feature vectors fi(ii) a The video encoder comprises a residual network of T shared parameters, whose input is a 224 x 224 three-channel image, and whose output is an n-dimensional feature vector;
decomposing video into frame-by-frame geometry f by using pre-trained attribute decomposition networki gTexture fi aPosture fi pIllumination fi lAnd deformation fi m(ii) a The attribute decomposition network comprises 5 sub-networks, including: a geometric estimation network, a texture estimation network, an attitude estimation network, an illumination estimation network and a deformation estimation network; wherein the geometric estimation network is a single-layer fully-connected layer, the input of the geometric estimation network is n-dimensional feature vectors, and the output is mgFeature vectors of dimensions for geometric estimation; the texture estimation network is a single-layer fully-connected layer, the input of the texture estimation network is n-dimensional feature vectors, and the output of the texture estimation network is maA feature vector of dimensions for texture estimation; the illumination estimation network is a single-layer full-connection layer, the input of the illumination estimation network is an n-dimensional characteristic vector, and the output of the illumination estimation network is a 27-dimensional characteristic vector for illumination estimation; the attitude estimation network is a single-layer full-connection layer, the input of the attitude estimation network is an n-dimensional characteristic vector, and the output of the attitude estimation network is a 6-dimensional characteristic vector for attitude estimation; the deformation estimation network is a single-layer full-connection layer, the input of the deformation estimation network is n-dimensional characteristic vectors, and the output of the deformation estimation network is mgFeature vectors of dimensions for deformation estimation;
step 1-3, preprocessing attributes; the method comprises static attribute consistency processing and dynamic attribute smoothness processing;
the static attribute consistency processing method comprises the following steps: consider T n-dimensional vectors { fiIf i ═ 1, …, T }, which expresses the same object, then the T n-dimensional vectors should be kept as consistent as possible, and the consistency vector f is solved using the following objective functioncon:
The frame-by-frame geometry f obtained in step 1-1i gAnd texture fi aRespectively input into a static attribute consistency processing method to obtain geometry after consistency processingAnd texture
The dynamic attribute smoothness processing method comprises the following steps: consider T n-dimensional column vectors { fiIf i 1, …, T expresses the state of the same object at T consecutive times, then the T n-dimensional vectors should change as smoothly as possible. Splicing T column vectors into an n multiplied by T matrix in the column direction according to the time sequence; the matrix is divided according to lines to obtain n vectors f 'of T dimension'jWhere j is 1, …, n, each vector is processed as follows:
f″j=kernel*f′j
where denotes a discrete convolution operation, kernel ═ 0.0545,0.244,0.403,0.244,0.0545]Denotes a one-dimensional convolution kernel, vector f'jPerforming convolution operation by using kernel in a mode of step length being 1 to obtain a smoothed result f ″j(ii) a Smoothing all n-dimensional vectors in the above manner, splicing the vectors back into an n multiplied by T matrix according to the sequence of rows, and splitting the matrix according to columns to obtain T smoothed n-dimensional column vectors; the frame-by-frame attitude f obtained in the step 1-1i pLight irradiation fi lAnd deformation fi mInputting a dynamic attribute smoothness processing method to obtain a smoothed attitudeIllumination of lightAnd deformation of
1-4, splicing multiple attribute vectors; the method comprises the following steps: for the geometry obtained from step 1-2TexturePostureIllumination of lightAnd deformation ofAnd carrying out vector splicing operation so as to construct a multivariate attribute model:
wherein<·,·,…,·>Operation of stitching, S, representing a plurality of vectorsiAnd representing the result of splicing the multi-element attribute vectors of the ith frame.
And 1-5, repeating the steps 1-1 to 1-3 for each video in the given initial video data set until each video in the video data set is processed, and completing the construction of the multivariate attribute model.
Step 2, generating a multi-attribute embedding space, wherein the method comprises the following steps: constructing a self-coding network based on a neural network; mapping the multivariate attribute model constructed in the step 1 to a low-dimensional embedding space through an encoder in a self-coding network; a decoder in the self-coding network is restored into a multi-element attribute model, and the self-coding network is trained through constrained reconstruction loss; generating a multi-attribute embedding space by recording the numerical value range of the low-dimensional embedding space; the method comprises the following steps:
step 2-1, constructing a self-coding network; the method comprises the following steps:
constructing a self-coding network based on a neural network, wherein the self-coding network comprises an encoder and a decoder;
wherein the encoder comprises m cell units sharing the parameter, the τ -th cell unit receives the cell state C of the τ -1 st cell unitτ-1Hidden state hτ-1And current time input xτAs an input; cell state C exported as the current cellτAnd a hidden state hτThe update rule of the self-coding network is as follows:
fτ=σ(Wf·<hτ-1,xτ>+bf)
pτ=σ(Wp·<hτ-1,xτ>+bp)
oτ=σ(Wo·<hτ-1,xτ>+bo)
hτ=oτ*tanh(Cτ)
wherein the content of the first and second substances,<·,·>representing the concatenation operation of two vectors, σ being a sigmoid function, tanh being a tanh function, (W)f,bf),(Wp,bp),(WC,bC),(Wo,bo) Respectively representWeights and offsets of the same four fully connected layers; the input of the encoder is a multivariate attribute model, and the output is 1 t-dimensional characteristic code fattr;
The decoder has a structure consistent with that of the dynamic mode coding network, and the input of the decoder is a t-dimensional vector fattrThe output is a multivariate attribute model; in the update rule, the encoder's cell state C at time 11And hidden state h1Set to all 0 vectors, input x at all times of the decoderτAre all set to fattrWhere τ is 1, …, m.
Step 2-2, self-coding training; the method comprises the following steps:
training the self-coding network in the step 2-1 by adopting a random gradient descent method, wherein a loss function comprises three parts: a reconstruction loss lsigrecA smoothing loss lsmAnd a loss of consistency lconsis(ii) a The reconstruction loss calculation method comprises the following steps:
wherein S isiRepresents the ith vector of the multivariate attribute model,representing the ith vector of the reconstructed multivariate attribute model;
the smoothing loss calculation method is as follows:
the calculation method of the multivariate attribute matrix S comprises the following steps: all vectors in the reconstructed multivariate attribute modelSuperimposing a matrix in the column direction, [:]denotes slicing operation, [ m ]g+ma:,:]Indicates that the m < th > matrix is selecteda+mgThe elements of the row to the last row,representing the second derivative in the x-direction of the matrix;
the consistency loss is calculated as follows:
finally, the total loss of training for the entire network is as follows:
l=λ1lsigrec+λ2lsmoo+λ3lconsis
wherein λ is1,λ2,λ3Is a balance factor.
And 2-3, generating a multi-attribute embedding space. The method comprises the following steps:
feature vector generation using the encoder in step 2-1 for all multivariate attribute modelsThe coding range of each dimension is calculated and stored in two vectors rl,rhWherein, the vector rlEach dimension records the minimum of the feature vector for that dimension, vector rhEach dimension records the maximum value of the feature vector in the dimension, and the multi-attribute embedding space is composed of all vectors r meeting the following conditions: { r | rl,α≤rα≤rh,αWhere, r is 1, …, t }, wherel,αIs represented by rlValue of the alpha dimension, rαValue representing the alpha dimension of the vector r, rh,aIs represented by rhThe value of the alpha dimension.
Step 3, synthesizing the multi-element attribute, wherein the method comprises the following steps: sampling from the multi-element attribute embedded space, and generating the multi-element attribute of each frame by using the decoder trained in the step 2; calculating the three-dimensional geometry and the texture of each frame of target object by using a shape synthesis method and an illumination synthesis method; rendering to generate a video sample; and repeating the multi-attribute synthesis process to obtain a plurality of specified video samples, and finally finishing the generation of the video samples synthesized based on the multi-attribute.
Step 3-1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in step 2-2
Step 3-2, generating dynamic attributes and static attributes: resolving static attribute part from multi-attribute model obtained by decodingAnd dynamic Properties sectionFurther splitting it into geometryTextureIllumination of lightPostureAnd deformation ofGenerating representative static attributes, geometriesTexture
Step 3-3, synthesizing a geometric shape: t frame three-dimensional grid modelThe calculation method of (2) is as follows:
wherein G isavgThe method is an average geometric shape defined in a three-dimensional deformable model 3DMM, the dimensionality of the average geometric shape is 3n, and n represents the number of grid vertexes; each 3 dimensions represents x, y, z coordinate values of a three-dimensional vertex; gneIs a neutral shape base defined in 3DMM and has the same dimension as GavgEach 3 dimensions represent the offset of a three-dimensional vertex coordinate value; gmorIs a deformation base of the shape defined in 3DMM, the dimension and meaning are the same as Gne(ii) a G obtained finallymeshThe vector is a vector with 3n of dimensionality, and is normalized into an n x 3 matrix through the heavy shape of the vector, wherein each row of the matrix represents a coordinate value of a model three-dimensional grid vertex;
step 3-4, synthesizing texture illumination: firstly, synthesizing textures, and restoring a calculation formula of per-vertex color values as follows:
wherein, TavgA vertex-by-vertex texture that is a 3d mm defined mean geometry with dimensions of 3n each 3 dimensions representing the R, G, B values of a three-dimensional vertex; t isneIs a texture base defined in 3DMM and has the same dimension as TavgEach 3 dimensions represent the offset of R, G and B values of a three-dimensional vertex; the T obtained finallymeshThe vector with 3n dimensionality is normalized into an n multiplied by 3 matrix through the heavy shape, wherein each row of the matrix represents the R, G and B values of the vertex of a model three-dimensional grid;
and (3) synthesizing illumination: frame t illuminationThe synthesis method comprises the following steps:
wherein the content of the first and second substances,a color value representing the ith vertex of the synthesized texture,a value representing the jth dimension of the illumination of the tth frame, and Y (j) represents the jth base of the spherical harmonic; the operation of each vertex is as above to obtain the color of the synthesized illumination per-vertex texture
Step 3-5, rendering: the synthesized t-th frame three-dimensional geometric shapeSynthesized illuminated vertex-by-vertex textureAnd attitudeSending into a renderer to draw a t frame image ItAnd mask image M of t-th frametRandomly cropping from another background picture and ItImages I of the same sizebAnd synthesizing the final video frame, wherein the synthesizing method comprises the following steps:
wherein |, indicates element-by-element multiplication;
3-6, writing the generated T frame image into a video stream to obtain a final video sample;
and 3-7, repeating the steps 3-1 to 3-6 until the set number is reached, and finishing the generation of the video sample based on the multivariate attribute synthesis.
Examples
As shown in fig. 1, the method for generating a video sample based on multivariate attribute synthesis disclosed in the present invention is specifically implemented according to the following steps:
Step 2, generating a multi-attribute embedding space: and (3) constructing a self-encoder based on a neural network, mapping the multi-element attribute model constructed in the step (1) to a low-dimensional embedding space through the encoder of the self-encoding network, reducing the multi-element attribute model into the multi-element attribute model through a decoder, and training the self-encoding network through constraint reconstruction loss. And generating the multi-attribute embedding space by recording the value range of the low-dimensional embedding space.
Step 3, multi-element attribute synthesis: sampling from a multi-element attribute embedding space, generating the multi-element attribute of each frame by utilizing a trained decoder, calculating the three-dimensional geometry and the texture of a target object of each frame by utilizing a shape synthesis method and an illumination synthesis method, and finally rendering to generate a video sample. Repeating the multivariate attribute synthesis process may result in multiple video samples specified by the user.
The main flow of each step is described in detail below
Wherein, step 1 includes the following steps:
step 1.1, decomposing video attributes: video I with pre-trained video encoder i1, …, T is encoded as a frame-by-frame feature vector { fiL 1, …, T, decomposing the frame-by-frame feature vector by using the attribute decomposition network to obtain the frame-by-frame geometry { f i g1, …, T }, texture { f |i a1, …, T, pose { f |i p1, …, T }, illumination { f |i l1, …, T |, and the deformation { f |i m|i,…,T}。
Step 1.2, attribute preprocessing, considering T n-dimensional vectors { f }iIf i is 1, …, T, which expresses the same object, then the T n-dimensional vectors should be kept as consistent as possible, and static attribute consistency processing is performed on the frame-by-frame geometry and texture to obtain consistency processed geometryAnd textureConsider T n-dimensional column vectors { f i1, …, T, which expresses the state of the same object at T consecutive times, then the T n-dimensional vectors should change as smoothly as possible, and the frame-by-frame pose, illumination and deformation should be smoothed to obtain the smoothed pose Illumination of lightAnd deformation of
Step 1.3, multi-element attribute vector splicing, for the geometry obtained from step 1.2TexturePostureIllumination of lightAnd deformation ofObtaining a multivariate attribute model by using vector splicing operation
And 1.4, repeating the steps 1-1 to 1-3 for each video in the original video data set until each video in the video data set is processed.
The step 2 comprises the following steps:
step 2.1, constructing a self-coding network, wherein the encoder comprises m cell units, the m cell units share parameters, and the ith cell unit receives the cell state C of the (i-1) th cell uniti-1Hidden state hi-1And current time input xiAs input, and outputting the cell state C of the current celliAnd a hidden state hiThe decoder has a structure consistent with a dynamic mode coding network, and is different in that the input of the decoder is a t-dimensional vector fattrThe output is a multivariate attribute model. In the update rule, the difference is the cell state C of the encoder at time 11And hidden state h1Set to the full 0 vector, input x at all times of the decoderiI is 1, …, m is set as fattr。
And 2.2, training a self-coding network, wherein the self-coding network constructed in the step 2.1 is trained by adopting a back propagation and random gradient descent method, and the reconstruction loss, the smoothing loss and the consistency loss are minimized.
Step 2.3, generating a multi-element attribute embedding space, and generating feature vectors for all multi-element attribute models by using the encoder in the step 2-1Calculating the coding range of each dimension and storing the two vectorsrl,rhWherein r islEach dimension records the minimum value, r, of the feature vector in that dimensionhEach dimension records the maximum value of that dimension of the feature vector.
Step 3 comprises the following steps
Step 3.1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in the step 2.2
Step 3.2, generating dynamic attributes and static attributes, and disassembling static attribute parts from the multi-attribute model obtained by decodingAnd dynamic Properties sectionAnd further split it into geometriesTextureIllumination of lightPostureAnd deformation ofNext, representative static attributes, geometries, are generatedTexture of
And 3.3, synthesizing a geometric shape, and synthesizing the geometric shape and the deformation into a three-dimensional shape according to the 3DMM model.
Step 3.4, texture illumination is synthesized, the texture is synthesized into vertex-by-vertex color values according to the 3DMM model, and the vertex-by-vertex texture colors after the synthesized illumination are obtained according to the spherical harmonic illumination formula
Step 3.5, rendering, namely, synthesizing the t-th frame three-dimensional geometric shapeSynthesized illuminated vertex-by-vertex textureAnd attitudeSending into a renderer to draw a t frame rendering image ItAnd mask image M of t-th frametRandomly cutting out sum I from another background picturetBackground image I of the same sizebAnd blending the rendered image and the background image according to the mask image.
And 3.6, writing the generated T frame image into a video stream to obtain a final video sample.
And 3.7, repeating the steps 3.1 to 3.6 until the number of samples required by the user is met.
In this embodiment, a human face is taken as an example for explanation. The specific implementation process is as follows:
in the step 1, the video data set mainly adopts the video data set in the documents Nagrani A, Chung J S, Zisserman A.Voxceleb: a large-scale marker identification dataset [ J ]. arXIv prediction arXIv:1706.08612,2017. Then step 1.1 is performed for attribute decomposition. The residual network in step 1.1 is constructed in the manner described in the document He K, Zhang X, Ren S, et al. deep residual learning for image registration [ C ]// Proceedings of the IEEE reference on computer vision and pattern registration.2016: 770-. Video coding networks and attribute decomposition networks were pre-trained on the CelebA dataset in the manner described in the documents Deng Y, Yang J, Xu S, et al, accurate 3d face retrieval with well-featured learning From single image to image set [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition works.2019: 0-0. The multivariate property model is then constructed according to steps 1.2 and 1.3.
In step 2, the self-coding network is trained on the multi-element attribute model obtained in the step 1, and a multi-element attribute embedding space is generated. The random gradient descent method described in step 2.1 is carried out in the manner described in the Bottou L.Stochastic gradient device lots [ M ]// Neural networks: lots of the trade. Springer, Berlin, Heidelberg,2012: 421-.
To illustrate some intermediate results in step 3, the present invention is first shown by taking the generation of a single frame as an example. Sampling the multi-attribute embedded space generated in the second step, generating a representative static attribute according to the steps 3.1 and 3.2, setting deformation and illumination to be a full 0 vector, obtaining a human face three-dimensional model and vertex-by-vertex color values by using the steps 3.3 and 3.4, rendering the result of multiple sampling by using the rendering step of the step 3.5 to obtain a result shown in the figure 2, then generating illumination according to the steps 3.1 and 3.2, re-synthesizing the vertex-by-vertex color values according to the step 3.4, and obtaining the result shown in the figure 3 by using the rendering of the step 3.5. Fig. 4 is the corresponding mask image. Wherein the white area represents a face area and the black area represents a background area.
Then, the invention shows how to generate the video multiframes, firstly, the multi-attribute embedded space generated in the step 2 is sampled, representative static attributes and t dynamic attributes are generated according to the steps 3.1 and 3.2, then, the rendered image of each frame is obtained according to the steps 3.3,3.4 and 3.5 by utilizing the static attributes and the dynamic attributes at each moment, and as the number of the rendered video frames is more, the invention takes one picture for each 3 frames and splices to obtain the image as shown in the figure 5. Then, the multi-attribute embedded space generated in the step 2 is sampled, representative static attributes and t dynamic attributes are generated according to the steps 3.1 and 3.2, the representative attributes of the second time are replaced by the representative attributes obtained after the first sampling calculation, a rendered image of each frame is obtained according to the steps 3.3,3.4 and 3.5, a picture is taken from each 3 frames, and the images shown in the figure 6 are obtained by splicing. Fig. 6 shows a new video segment with frames at a sampling interval of 3. And the identity of the person in fig. 6 and 5 is consistent, with the difference being the lighting, deformation, and pose.
3.3,3.4 the 3DMM model is the model described in Paysan P, Knothe R, Amberg B, et al.A 3D surface model for position and estimation innovative surface registration [ C ]//2009six IEEE international conference on advanced video and signal based scientific. Ieee,2009:296-301.
The present invention provides a method and a system for generating a video sample based on multivariate attribute synthesis, and a plurality of methods and approaches for implementing the technical solution are provided, the above description is only a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.
Claims (10)
1. A video sample generation method based on multivariate attribute synthesis is characterized by comprising the following steps:
step 1, constructing a multi-attribute model, wherein the method comprises the following steps: giving an initial video data set, preprocessing each video in the data set, and decomposing each video into a plurality of static attributes and dynamic attributes of a foreground object frame by frame according to a pre-trained video encoder and an attribute decomposition network; respectively carrying out consistency and smoothness processing on the multivariate static attribute and the dynamic attribute according to consistency constraint and smoothness constraint; generating a multivariate attribute model through vector splicing;
step 2, generating a multi-attribute embedding space, wherein the method comprises the following steps: constructing a self-coding network based on a neural network; mapping the multivariate attribute model constructed in the step 1 to a low-dimensional embedding space through an encoder in a self-coding network; a decoder in the self-coding network is restored into a multi-attribute model, and the self-coding network is trained through constrained reconstruction loss; generating a multi-attribute embedding space by recording the numerical value range of the low-dimensional embedding space;
step 3, synthesizing the multivariate attribute, wherein the method comprises the following steps: sampling from the multi-element attribute embedded space, and generating the multi-element attribute of each frame by using the decoder trained in the step 2; calculating the three-dimensional geometry and the texture of each frame of target object by using a shape synthesis method and an illumination synthesis method; rendering to generate a video sample; and repeating the process of synthesizing the multi-attribute to obtain a plurality of appointed video samples, and finally finishing the generation of the video samples synthesized based on the multi-attribute.
2. The method for generating a video sample based on multivariate attribute synthesis as claimed in claim 1, wherein the step 1 comprises the following steps:
step 1-1, video preprocessing, the method comprising:
utilizing a pre-trained target detector to carry out frame-by-frame target detection on the video, cutting out targets in the video according to a detected bounding box, and finally scaling each cut target to the size of 224 multiplied by 224 to obtain a processed video frame sequence Ii1, where T, i denotes the ith frame and T denotes the total number of frames in the video;
step 1-2, decomposing video attributes;
step 1-3, preprocessing attributes;
1-4, splicing multiple attribute vectors;
and 1-5, repeating the steps 1-1 to 1-3 for each video in the given initial video data set until each video in the video data set is processed, and completing the construction of the multivariate attribute model.
3. The method according to claim 2, wherein the video attribute decomposition method in step 1-2 comprises:
utilizing a pre-trained video encoder to pre-process the video sequence I obtained in the step 1-1iEncoding into frame-by-frame feature vectors fi(ii) a The video encoder comprises a residual network of T shared parameters, whose input is a 224 x 224 three-channel image, and whose output is an n-dimensional feature vector;
decomposing video into frame-by-frame geometry f by using pre-trained attribute decomposition networki gTexture fi aPosture fi pIllumination fi lAnd deformation fi m(ii) a The attribute decomposition network comprises 5 sub-networks, including: a geometric estimation network, a texture estimation network, an attitude estimation network, an illumination estimation network and a deformation estimation network; wherein the geometry estimation network is a single-layer fully-connected layer, the input of the geometry estimation network is n-dimensional feature vectors, and the output of the geometry estimation network is mgFeature vectors of dimensions for geometric estimation; the texture estimation network is a single-layer fully-connected layer, the input of the texture estimation network is n-dimensional feature vectors, and the output of the texture estimation network is maA feature vector of dimensions for texture estimation; the illumination estimation network is a single-layer full-connection layer, the input of the illumination estimation network is an n-dimensional feature vector, and the output of the illumination estimation network is a 27-dimensional feature vector for illumination estimation; the attitude estimation network is a single-layer full-connection layer, the input of the attitude estimation network is an n-dimensional characteristic vector, and the output of the attitude estimation network is a 6-dimensional characteristic vector for attitude estimation; the deformation estimation network is a single-layer full-connection layer, the input of the deformation estimation network is n-dimensional eigenvector, and the output of the deformation estimation network is mgFeature vectors of dimensions for deformation estimation.
4. The method for generating a video sample based on multivariate attribute synthesis as claimed in claim 3, wherein the attribute preprocessing method in step 1-3 comprises static attribute consistency processing and dynamic attribute smoothness processing;
the static attribute consistency processing method comprises the following steps: solving the consistency vector f by using the following objective functioncon:
The frame-by-frame geometry f obtained in step 1-1i gAnd texture fi aRespectively input into a static attribute consistency processing method to obtain the geometry after consistency processingAnd texture
The dynamic attribute smoothness processing method comprises the following steps: splicing the T column vectors into an n multiplied by T matrix in the column direction according to the time sequence; the matrix is divided according to lines to obtain n vectors f 'of T dimension'jWherein j 1.. n, each vector is processed as follows:
f″j=kernel*fj′
where denotes a discrete convolution operation, kernel ═ 0.0545,0.244,0.403,0.244,0.0545]Representing a one-dimensional convolution kernel, vector fj' obtaining a smoothed result f by performing convolution operation with kernel in a manner of step size 1j"; after smoothing all the n-dimensional vectors in the mode, splicing the vectors back into an n multiplied by T matrix according to the sequence of rows, and splitting the matrix according to columns to obtain T smoothed n-dimensional column vectors; the frame-by-frame attitude f obtained in the step 1-1i pIllumination fi lAnd deformation fi mInputting a dynamic attribute smoothness processing method to obtain a smoothed attitudeIllumination of lightAnd deformation of
5. The method according to claim 4, wherein the method for generating the video sample based on multivariate attribute synthesis in steps 1-4 comprises: for the geometry obtained from step 1-2TexturePostureIllumination of lightAnd deformation ofAnd performing vector splicing operation to construct a multivariate attribute model:
wherein<·,·,...,·>Operation of stitching, S, representing a plurality of vectorsiAnd representing the result of splicing the multi-element attribute vectors of the ith frame.
6. The method according to claim 5, wherein step 2 comprises:
step 2-1, constructing a self-coding network;
step 2-2, self-coding training;
and 2-3, generating a multi-attribute embedding space.
7. The method according to claim 6, wherein the method for constructing the self-coding network in step 2-1 comprises:
constructing a self-coding network based on a neural network, wherein the self-coding network comprises an encoder and a decoder;
wherein the encoder comprises m cell units sharing parameters, the τ th cell unit receiving the cell state C of the τ -1 st cell unitτ-1Hidden state hτ-1And current time input xτAs an input; cell state C exported as the current cellτAnd a hidden state hτThe update rule of the self-coding network is as follows:
fτ=σ(Wf·<hτ-1,xτ>+bf)
pτ=σ(Wp·<hτ-1,xτ>+bp)
oτ=σ(Wo·<hτ-1,xτ>+bo)
hτ=oτ*tanh(Cτ)
wherein the content of the first and second substances,<·,·>represents the splicing operation of two vectors, sigma is sigmoid function, tanh is tanh function, (W)f,bf),(Wp,bp),(WC,bC),(Wo,bo) Respectively representing the weight and the offset of four different fully-connected layers; the input of the encoder is a multivariate attribute model, and the output is 1 t-dimensional characteristic code fattr;
The decoder structure and dynamic mode coding networkThe input to the decoder is a t-dimensional vector fattrThe output is a multivariate attribute model; in the update rule, the encoder's cell state C at time 11And hidden state h1Set to all 0 vectors, input x at all times of the decoderτAre all set to fattrWhere τ is 1.
8. The method according to claim 7, wherein the self-coding training method in step 2-2 comprises:
training the self-coding network in the step 2-1 by adopting a random gradient descent method, wherein a loss function comprises three parts: a reconstruction loss lsigrecA smoothing loss lsmoothAnd a loss of consistency lconsis(ii) a The reconstruction loss calculation method comprises the following steps:
wherein S isiRepresents the ith vector of the multivariate attribute model,representing the ith vector of the reconstructed multivariate attribute model;
the smoothing loss calculation method is as follows:
the calculation method of the multivariate attribute matrix S comprises the following steps: all vectors in the reconstructed multivariate attribute modelSuperimposed in the column direction as a matrix, [: ,:]denotes slicing operation, [ m ]g+ma:,:]Indicates that the m < th > matrix is selecteda+mgThe elements of the row to the last row,representing the second derivative in the x direction of the matrix;
the consistency loss is calculated as follows:
finally, the total loss of training for the entire network is as follows:
l=λ1lsigrec+λ2lsmooth+λ3lconsis
wherein λ is1,λ2,λ3Is a balance factor.
9. The method according to claim 8, wherein the method for generating the multivariate attribute-based composite video sample in step 2-3 comprises:
feature vector generation using the encoder in step 2-1 for all multivariate attribute modelsThe coding range of each dimension is calculated and stored in two vectors rl,rhWherein, the vector rlEach dimension records the minimum of the feature vector for that dimension, vector rhEach dimension records the maximum value of the feature vector in the dimension, and the multi-attribute embedding space is composed of all vectors r meeting the following conditions: { r | rl,α≤rα≤rh,α1, a, t }, wherein rl,αIs represented by rlValue of the alpha dimension, rαValue representing the alpha dimension of the vector r, rh,αIs represented by rhThe value of the alpha dimension.
10. The method according to claim 9, wherein step 3 comprises:
step 3-1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in step 2-2
Step 3-2, generating dynamic attributes and static attributes: resolving static attribute part from multi-attribute model obtained by decodingAnd dynamic Properties sectionFurther splitting it into geometriesTextureIllumination of lightPostureAnd deformation ofGenerating representative static attributes, geometriesTexture
Step 3-3, synthesizing a geometric shape: t frame three-dimensional grid modelThe calculation method of (2) is as follows:
wherein G isavgThe method is an average geometric shape defined in a three-dimensional deformable model 3DMM, the dimensionality of the average geometric shape is 3n, and n represents the number of grid vertexes; each 3 dimensions represent x, y and z coordinate values of a three-dimensional vertex; gneIs a neutral shape group as defined in 3DMM and has the same dimension as GavgEach 3 dimensions represents the offset of a three-dimensional vertex coordinate value; gmorIs a deformation base of the shape defined in 3DMM, the dimension and meaning are the same as Gne(ii) a G obtained finallymeshThe vector with 3n dimensionality is normalized into an n multiplied by 3 matrix through the heavy shape, wherein each row of the matrix represents a coordinate value of a model three-dimensional grid vertex;
step 3-4, texture synthesis illumination: firstly, synthesizing textures, and restoring a calculation formula of per-vertex color values as follows:
wherein, TavgA vertex-by-vertex texture of average geometry defined by 3DMM with dimensions 3n representing the R, G, B values of a three-dimensional vertex every 3 dimensions; t is a unit ofneIs a texture base defined in 3DMM and has the same dimension as TavgEach 3 dimensions represent the offset of R, G and B values of a three-dimensional vertex; the final TmeshIs a vector of dimension 3n, which is normalized by reshaping it into an n x 3 matrix, where each row of the matrix is represented by a weightRepresenting the R, G and B values of the vertexes of a three-dimensional mesh of the model;
and (3) synthesizing illumination: frame t illuminationThe synthesis method comprises the following steps:
wherein the content of the first and second substances,a color value representing the ith vertex of the synthesized texture,a value representing the jth dimension of the illumination of the tth frame, and Y (j) represents the jth base of the spherical harmonic; the operation of each vertex is as above to obtain the color of the synthesized illumination per-vertex texture
Step 3-5, rendering: the synthesized t-th frame three-dimensional geometric shapeSynthesized illuminated per-vertex textureAnd attitudeSending into a renderer to draw a t frame image ItAnd mask image M of t-th frametRandomly cropping from another background picture and ItImages I of the same sizebAnd synthesizing the final video frame, wherein the synthesizing method comprises the following steps:
wherein |, indicates element-by-element multiplication;
step 3-6, writing the generated T frame image into a video stream to obtain a final video sample;
and 3-7, repeating the steps 3-1 to 3-6 until the set number is reached, and finishing the generation of the video sample based on the multivariate attribute synthesis.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210423708.2A CN114694081A (en) | 2022-04-21 | 2022-04-21 | Video sample generation method based on multivariate attribute synthesis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210423708.2A CN114694081A (en) | 2022-04-21 | 2022-04-21 | Video sample generation method based on multivariate attribute synthesis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114694081A true CN114694081A (en) | 2022-07-01 |
Family
ID=82144208
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210423708.2A Pending CN114694081A (en) | 2022-04-21 | 2022-04-21 | Video sample generation method based on multivariate attribute synthesis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114694081A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115499396A (en) * | 2022-11-16 | 2022-12-20 | 北京红棉小冰科技有限公司 | Information generation method and device with personality characteristics |
CN116843862A (en) * | 2023-08-29 | 2023-10-03 | 武汉必盈生物科技有限公司 | Three-dimensional thin-wall model grid surface texture synthesis method |
-
2022
- 2022-04-21 CN CN202210423708.2A patent/CN114694081A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115499396A (en) * | 2022-11-16 | 2022-12-20 | 北京红棉小冰科技有限公司 | Information generation method and device with personality characteristics |
CN115499396B (en) * | 2022-11-16 | 2023-04-07 | 北京红棉小冰科技有限公司 | Information generation method and device with personality characteristics |
CN116843862A (en) * | 2023-08-29 | 2023-10-03 | 武汉必盈生物科技有限公司 | Three-dimensional thin-wall model grid surface texture synthesis method |
CN116843862B (en) * | 2023-08-29 | 2023-11-24 | 武汉必盈生物科技有限公司 | Three-dimensional thin-wall model grid surface texture synthesis method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lee et al. | Context-aware synthesis and placement of object instances | |
CN109389671B (en) | Single-image three-dimensional reconstruction method based on multi-stage neural network | |
CN110390638B (en) | High-resolution three-dimensional voxel model reconstruction method | |
CN111047548B (en) | Attitude transformation data processing method and device, computer equipment and storage medium | |
CN111368662B (en) | Method, device, storage medium and equipment for editing attribute of face image | |
CN114694081A (en) | Video sample generation method based on multivariate attribute synthesis | |
CN116958453B (en) | Three-dimensional model reconstruction method, device and medium based on nerve radiation field | |
Huang et al. | Ponder: Point cloud pre-training via neural rendering | |
Yun et al. | Joint face super-resolution and deblurring using generative adversarial network | |
Chen et al. | Domain adaptation for underwater image enhancement via content and style separation | |
CN111462274A (en) | Human body image synthesis method and system based on SMP L model | |
CN112634438A (en) | Single-frame depth image three-dimensional model reconstruction method and device based on countermeasure network | |
Lei et al. | NITES: A non-parametric interpretable texture synthesis method | |
RU2713695C1 (en) | Textured neural avatars | |
Kouzani et al. | Towards invariant face recognition | |
Zhang et al. | DIMNet: Dense implicit function network for 3D human body reconstruction | |
CN110322548B (en) | Three-dimensional grid model generation method based on geometric image parameterization | |
CN117173445A (en) | Hypergraph convolution network and contrast learning multi-view three-dimensional object classification method | |
CN111311732A (en) | 3D human body grid obtaining method and device | |
CN116452715A (en) | Dynamic human hand rendering method, device and storage medium | |
CN113129347B (en) | Self-supervision single-view three-dimensional hairline model reconstruction method and system | |
Laradji et al. | SSR: Semi-supervised Soft Rasterizer for single-view 2D to 3D Reconstruction | |
De Souza et al. | Fundamentals and challenges of generative adversarial networks for image-based applications | |
Shangguan et al. | 3D human pose dataset augmentation using generative adversarial network | |
Johnston et al. | Single View 3D Point Cloud Reconstruction using Novel View Synthesis and Self-Supervised Depth Estimation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |