CN114694081A - Video sample generation method based on multivariate attribute synthesis - Google Patents

Video sample generation method based on multivariate attribute synthesis Download PDF

Info

Publication number
CN114694081A
CN114694081A CN202210423708.2A CN202210423708A CN114694081A CN 114694081 A CN114694081 A CN 114694081A CN 202210423708 A CN202210423708 A CN 202210423708A CN 114694081 A CN114694081 A CN 114694081A
Authority
CN
China
Prior art keywords
attribute
video
frame
dimensional
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210423708.2A
Other languages
Chinese (zh)
Inventor
孙正兴
骆守桐
孙蕴瀚
徐烨超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202210423708.2A priority Critical patent/CN114694081A/en
Publication of CN114694081A publication Critical patent/CN114694081A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video sample generation method based on multivariate attribute synthesis, which comprises the following steps: constructing a multivariate attribute model: firstly, decomposing a multi-element static attribute and a dynamic attribute of a foreground object frame by frame from a video according to a pre-trained video encoder and an attribute decomposition network, and then respectively carrying out consistency and smoothness processing on the static attribute and the dynamic attribute according to consistency constraint and smoothness constraint; finally generating a multi-element attribute model through vector splicing; generating a multi-attribute embedding space: constructing a self-encoder based on a neural network, training the self-encoder and generating a multivariate attribute embedding space; and (3) multi-element attribute synthesis: sampling from a multi-element attribute embedding space, generating the multi-element attribute of each frame by utilizing a trained decoder, calculating the three-dimensional geometry and the texture of a target object of each frame by utilizing a shape synthesis method and an illumination synthesis method, and finally rendering to generate a video sample. Repeating the multivariate attribute synthesis process may result in multiple video samples specified by the user.

Description

Video sample generation method based on multivariate attribute synthesis
Technical Field
The invention relates to a video sample generation method, in particular to a video sample generation method based on multi-attribute synthesis.
Background
With the rapid development of the information age, video data is showing an explosive growth trend, and downstream applications based on video are increasingly demanded, such as video target detection, video prediction and the like. Thanks to the development of artificial intelligence, artificial intelligence technologies, represented by deep learning, have enjoyed great success in various applications, without leaving the support of a large number of training samples. However, despite the large amount of video data on the internet, collecting a wide variety of video samples remains a difficult task.
Although methods for synthesizing large amounts of video data have emerged in the year, for example, Yang C, Wang Z, Zhu X, et al, Pose-defined human video generation [ C ]// Proceedings of the European Conference reference on Computer Vision (ECCV) [ 2018:201- & 216 ], Dorkenwald M, Millich T, Blattmann A, et al, Stoechic image-to-video synthesis of the IEEE/CVF Conference reference [ C ]// Proceedings of the IEEE/CVF Conference reference on Computer Vision. 371: 3742- & 3753. However, these methods aim at improving the quality of the synthesized video and cannot meet the diversity requirements required for video samples.
The data diversity required by the video sample is mainly characterized in that the geometry, texture, illumination, posture and deformation of the target in the video are various. Most of the literature today focuses on solving the problem of diversity in image synthesis, such as: document 1, Z.Yu and C.Zhang, "Image based static interface expression with multiple deep network learning," in Proceedings of the 2015ACM on International Conference on Multi interaction. ACM,2015, pp.435-442, by traditional cutting, flipping and other enhancement modesSolving the problem of direction and scale diversity, document 2: zhou H, Liu J, Liu Z, et al, rotate-and-Render: Unvervised Photocosmetic Face Rotation from Single-View Images [ C]5911-5920. first, a single image is fitted by using a three-dimensional model, then the three-dimensional model is subjected to three-dimensional rotation and rendering to obtain a face image after initial rotation, and finally a face image after rotation with sense of reality is obtained through a GAN frame to solve the problem of posture diversity, wherein the document 3: goodfellow, J.Pouget-Abadie, M.Mirza, B.Xu, D.Warde-Farley, S.Ozair, A.Courville, and Y.Bengio, "genetic adaptation networks," in Advances in neural information processing systems,2014, pp.2672-2680. the problem of deformation diversity is solved by means of GAN. Document 4: m.shin, m.kim, and d. — s.kwon, "base cn structure analysis for facial expression recognition," in Robot and Human Interactive Communication (RO-MAN), 201625 th IEEE International Symposium on.ieee,2016, pp.724-729. decomposing a face image into a high frequency component and a low frequency component by fourier transform, and finally reconstructing into a new face image by inverse fourier transform, however, this method only improves the contrast of the image and cannot solve the problem of illumination diversity, document 5: von Li face illumination regularization method based on SFS algorithm [ D]2005, the SFS algorithm is used to decompose the face image into three components of albedo, normal and illumination, and the illumination component is replaced by a forward light source, and then reconstructed into the face image to realize image illumination regularization. Document 6 Wang C, Wang C, Xu C, et al. tag distributed generated adaptive networks for object image re-rendering [ C]2017, decomposing the face image into a series of labels of identity, posture, illumination and the like, and operating the illumination label to generate a new face image. Their method has some ability to generate pictures with diverse lighting. Some of the above methods are mainly directed to the expansion of image samples. In practice, video applications are often more than single image applicationsFor example, video monitoring, live webcasting, etc., it is necessary to provide an extension scheme for video data, and the existing video data extension schemes include: patent 1:Chen Li,gaoChinese, 10743444.7P]2019-12-24, however they simply fuse existing datasets and do not make a special design for diversity. The invention provides an expansion method of video data on the basis of the methods to meet the diversity requirement.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem of providing a video sample generation method based on multi-attribute synthesis aiming at the defects of the prior art.
In order to solve the technical problem, the invention discloses a video sample generation method based on multivariate attribute synthesis, which comprises the following steps:
step 1, constructing a multi-attribute model, wherein the method comprises the following steps: giving an initial video data set, preprocessing each video in the data set, and decomposing each video into a plurality of static attributes and dynamic attributes of a foreground object frame by frame according to a pre-trained video encoder and an attribute decomposition network; respectively carrying out consistency and smoothness processing on the multivariate static attribute and the dynamic attribute according to consistency constraint and smoothness constraint; generating a multivariate attribute model through vector splicing;
step 2, generating a multi-attribute embedding space, wherein the method comprises the following steps: constructing a self-coding network based on a neural network; mapping the multivariate attribute model constructed in the step 1 to a low-dimensional embedding space through an encoder in a self-coding network; a decoder in the self-coding network is restored into a multi-attribute model, and the self-coding network is trained through constrained reconstruction loss; generating a multi-attribute embedding space by recording the numerical value range of the low-dimensional embedding space;
step 3, synthesizing the multi-element attribute, wherein the method comprises the following steps: sampling from the multi-element attribute embedded space, and generating the multi-element attribute of each frame by using the decoder trained in the step 2; calculating the three-dimensional geometry and the texture of each frame of target object by using a shape synthesis method and an illumination synthesis method; rendering to generate a video sample; and repeating the process of synthesizing the multi-attribute to obtain a plurality of appointed video samples, and finally finishing the generation of the video samples synthesized based on the multi-attribute.
In the invention, the step 1 comprises the following steps:
step 1-1, video preprocessing;
step 1-2, decomposing video attributes;
step 1-3, preprocessing attributes;
1-4, splicing multiple attribute vectors;
and 1-5, repeating the steps 1-1 to 1-3 for each video in the given initial video data set until each video in the video data set is processed, and completing the construction of the multivariate attribute model.
In the invention, the video preprocessing method in the step 1-1 comprises the following steps:
utilizing a pre-trained target detector to carry out frame-by-frame target detection on the video, cutting out targets in the video according to a detected bounding box, and finally scaling each cut target to the size of 224 multiplied by 224 to obtain a processed video frame sequence IiI is 1, …, T, i represents the ith frame, and T represents the total number of frames in the video.
The video attribute decomposition method in step 1-2 of the invention comprises the following steps:
utilizing a pre-trained video encoder to pre-process the video sequence I obtained in the step 1-1iEncoding into frame-by-frame feature vectors fi(ii) a The video encoder comprises a residual network of T shared parameters, whose input is a 224 x 224 three-channel image, and whose output is an n-dimensional feature vector;
decomposing video into frame-by-frame geometry f by using pre-trained attribute decomposition networki gTexture fi aPosture fi pIllumination fi lAnd deformation fi m(ii) a The attribute decomposition network comprises 5 sub-networks, including: geometry estimation network, texture estimation network, attitude estimation network, illumination estimation network, and deformationEstimating a network; wherein the geometric estimation network is a single-layer fully-connected layer, the input of the geometric estimation network is n-dimensional feature vectors, and the output is mgFeature vectors of dimensions for geometric estimation; the texture estimation network is a single-layer fully-connected layer, the input of the texture estimation network is n-dimensional feature vectors, and the output of the texture estimation network is maA feature vector of dimensions for texture estimation; the illumination estimation network is a single-layer full-connection layer, the input of the illumination estimation network is an n-dimensional feature vector, and the output of the illumination estimation network is a 27-dimensional feature vector for illumination estimation; the attitude estimation network is a single-layer full-connection layer, the input of the attitude estimation network is an n-dimensional characteristic vector, and the output of the attitude estimation network is a 6-dimensional characteristic vector for attitude estimation; the deformation estimation network is a single-layer full-connection layer, the input of the deformation estimation network is n-dimensional eigenvector, and the output of the deformation estimation network is mgFeature vectors of dimensions for deformation estimation;
the attribute preprocessing method in the steps 1-3 of the invention comprises static attribute consistency processing and dynamic attribute smoothness processing;
the static attribute consistency processing method comprises the following steps: solving the consistency vector f by using the following objective functioncon
Figure BDA0003607550030000041
The frame-by-frame geometry f obtained in step 1-1i hAnd texture fi aRespectively input into a static attribute consistency processing method to obtain the geometry after consistency processing
Figure BDA0003607550030000042
And texture
Figure BDA0003607550030000043
The dynamic attribute smoothness processing method comprises the following steps: splicing the T column vectors into an n multiplied by T matrix in the column direction according to the time sequence; splitting the matrix by rows to obtain n vectors f 'of T dimensions'jWhere j is 1, …, n, the following is performed for each vectorProcessing:
f″j=kernel*f′j
where denotes a discrete convolution operation, kernel ═ 0.0545,0.244,0.403,0.244,0.0545]Denotes a one-dimensional convolution kernel, vector f'jPerforming convolution operation by using kernel in a mode of step length being 1 to obtain a smoothed result f ″j(ii) a Smoothing all n-dimensional vectors in the above manner, splicing the vectors back into an n multiplied by T matrix according to the sequence of rows, and splitting the matrix according to columns to obtain T smoothed n-dimensional column vectors; the frame-by-frame attitude f obtained in the step 1-1i pIllumination fi lAnd deformation fi mInputting a dynamic attribute smoothness processing method to obtain a smoothed attitude
Figure BDA0003607550030000044
Illumination of light
Figure BDA0003607550030000045
And deformation of
Figure BDA0003607550030000046
The multi-attribute vector splicing method in the steps 1-4 of the invention comprises the following steps: for the geometry obtained from step 1-2
Figure BDA0003607550030000047
Texture
Figure BDA0003607550030000048
Posture
Figure BDA0003607550030000049
Illumination of light
Figure BDA00036075500300000410
And deformation of
Figure BDA00036075500300000411
And carrying out vector splicing operation so as to construct a multivariate attribute model:
Figure BDA0003607550030000051
wherein<·,·,…,·>Operation of stitching, S, representing a plurality of vectorsiAnd representing the result of splicing the multi-element attribute vectors of the ith frame.
In the invention, the step 2 comprises the following steps:
step 2-1, constructing a self-coding network;
step 2-2, self-coding training;
and 2-3, generating a multi-attribute embedding space.
The method for constructing the self-coding network in the step 2-1 comprises the following steps:
constructing a self-coding network based on a neural network, wherein the self-coding network comprises an encoder and a decoder;
wherein the encoder comprises m cell units sharing parameters, the τ th cell unit receiving the cell state C of the τ -1 st cell unitτ-1Hidden state hτ-1And current time input xτAs an input; cell state C exported as the current cellτAnd a hidden state hτThe update rule of the self-coding network is as follows:
fτ=σ(Wf·<hτ-1,xτ>+bf)
pτ=σ(Wp·<hτ-1,xτ>+bp)
Figure BDA0003607550030000052
Figure BDA0003607550030000053
oτ=σ(Wo·<hτ-1,xτ>+bo)
hτ=oτ*tanh(Cτ)
wherein the content of the first and second substances,<·,·>representing the concatenation operation of two vectors, σ being a sigmoid function, tanh being a tanh function, (W)f,bf),(Wp,bp),(WC,bC),(Wo,bo) Respectively representing the weight and the offset of four different fully-connected layers; the input of the encoder is a multivariate attribute model, and the output is 1 t-dimensional characteristic code fattr
The decoder has a structure consistent with that of the dynamic mode coding network, and the input of the decoder is a t-dimensional vector fattrThe output is a multivariate attribute model; in the update rule, the encoder's cell state C at time 11And hidden state h1Set to all 0 vectors, input x at all times of the decoderτAre all set to fattrWhere τ is 1, …, m.
The self-coding training method in step 2-2 of the invention comprises the following steps:
training the self-coding network in the step 2-1 by adopting a random gradient descent method, wherein a loss function comprises three parts: a reconstruction loss lsigrecA smoothing loss lsmoothAnd a loss of consistency lconsis(ii) a The reconstruction loss calculation method comprises the following steps:
Figure BDA0003607550030000061
wherein S isiRepresents the ith vector of the multivariate attribute model,
Figure BDA0003607550030000062
representing the ith vector of the reconstructed multivariate attribute model;
the smoothing loss calculation method is as follows:
Figure BDA0003607550030000063
the calculation method of the multivariate attribute matrix S comprises the following steps: will rebuild much moreAll vectors in the MetaAttribute model
Figure BDA0003607550030000064
Superimposing a matrix in the column direction, [:]denotes slicing operation, [ m ]g+ma:,:]Indicates that the m < th > matrix is selecteda+mgThe elements of the row to the last row,
Figure BDA0003607550030000065
representing the second derivative in the x-direction of the matrix;
the consistency loss is calculated as follows:
Figure BDA0003607550030000066
finally, the total loss of training for the entire network is as follows:
l=λ1lsigrec2lsmoo3lconsis
wherein λ is123Is a balance factor.
The method for generating the multi-attribute embedding space in the step 2-3 comprises the following steps:
feature vector generation using the encoder in step 2-1 for all multivariate attribute models
Figure BDA0003607550030000067
The coding range of each dimension is calculated and stored in two vectors rl,rhWherein, the vector rlEach dimension records the minimum of the feature vector for that dimension, vector rhEach dimension records the maximum value of the feature vector in the dimension, and the multi-attribute embedding space is composed of all vectors r meeting the following conditions: { r | rl,α≤rα≤rh,αWhere, r is 1, …, t }, wherel,αIs represented by rlValue of the alpha dimension, rαValue representing the alpha dimension of the vector r, rh,aIs represented by rhThe value of the alpha dimension.
In the present invention, step 3 comprises:
step 3-1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in step 2-2
Figure BDA0003607550030000068
Figure BDA0003607550030000069
Step 3-2, generating dynamic attributes and static attributes: resolving static attribute part from multi-attribute model obtained by decoding
Figure BDA0003607550030000071
And dynamic Properties section
Figure BDA0003607550030000072
Further splitting it into geometry
Figure BDA0003607550030000073
Texture
Figure BDA0003607550030000074
Illumination of light
Figure BDA0003607550030000075
Posture
Figure BDA0003607550030000076
And deformation of
Figure BDA0003607550030000077
Generating representative static attributes, geometries
Figure BDA0003607550030000078
Texture
Figure BDA0003607550030000079
Step 3-3, synthesizing a geometric shape: tth frameThree-dimensional grid model
Figure BDA00036075500300000710
The calculation method of (2) is as follows:
Figure BDA00036075500300000711
wherein, GavgThe method is an average geometric shape defined in a three-dimensional deformable model 3DMM, the dimensionality of the average geometric shape is 3n, and n represents the number of grid vertexes; each 3 dimensions represents x, y, z coordinate values of a three-dimensional vertex; gneIs a neutral shape base defined in 3DMM and has the same dimension as GavgEach 3 dimensions represents the offset of a three-dimensional vertex coordinate value; gmorIs a deformation base of the shape defined in 3DMM, the dimension and meaning are the same as Gne(ii) a G obtained finallymeshThe vector is a vector with 3n of dimensionality, and is normalized into an n x 3 matrix through the heavy shape of the vector, wherein each row of the matrix represents a coordinate value of a model three-dimensional grid vertex;
step 3-4, texture synthesis illumination: firstly, synthesizing textures, and restoring a calculation formula of per-vertex color values as follows:
Figure BDA00036075500300000712
wherein, TavgA vertex-by-vertex texture of average geometry defined by 3DMM with dimensions 3n representing the R, G, B values of a three-dimensional vertex every 3 dimensions; t isneIs a texture base defined in 3DMM and has the same dimension as TavgEach 3 dimensions represent the offset of R, G and B values of a three-dimensional vertex; the final TmeshThe vector is a vector with 3n of dimensionality, and is normalized into an n x 3 matrix through the heavy shape of the vector, wherein each row of the matrix represents the R, G and B values of the vertex of a three-dimensional mesh of a model;
and (3) synthesizing illumination: frame t illumination
Figure BDA00036075500300000713
Synthetic methodThe method comprises the following steps:
Figure BDA00036075500300000714
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00036075500300000715
a color value representing the ith vertex of the synthesized texture,
Figure BDA00036075500300000716
a value representing the jth dimension of the illumination of the tth frame, and Y (j) represents the jth base of the spherical harmonic; the operation of each vertex is as above to obtain the color of the synthesized illumination per-vertex texture
Figure BDA00036075500300000717
Step 3-5, rendering: the synthesized t-th frame three-dimensional geometric shape
Figure BDA00036075500300000718
Synthesized illuminated vertex-by-vertex texture
Figure BDA0003607550030000081
And attitude
Figure BDA0003607550030000082
Sending the image into a renderer to draw a t frame image ItAnd mask image M of t-th frametRandomly cropping from another background picture and ItImages I of the same sizebAnd synthesizing the final video frame, wherein the synthesizing method comprises the following steps:
Figure BDA0003607550030000083
wherein |, indicates element-by-element multiplication;
step 3-6, writing the generated T frame image into a video stream to obtain a final video sample;
and 3-7, repeating the steps 3-1 to 3-6 until the set number is reached, and finishing the generation of the video sample based on the multivariate attribute synthesis.
Has the advantages that:
a multivariate attribute model is built from the initial video data set to generate a multivariate attribute embedding space, and a new sample is generated in a multivariate attribute synthesis mode, so that the scale of the initial video data set is effectively expanded, and the diversity of the initial video data set is increased.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic process flow diagram of the present invention.
Fig. 2 is a schematic diagram of a rendered picture artificially picking some feature vectors after embedding spatial random sampling from multivariate attributes.
Fig. 3 is a schematic diagram of a picture after illumination synthesis.
Fig. 4 is a mask diagram.
Fig. 5 is a schematic diagram of the generated picture.
FIG. 6 is a schematic diagram of another set of generated pictures.
Detailed Description
A video sample generation method based on multivariate attribute synthesis comprises the following steps:
step 1, constructing a multi-attribute model, wherein the method comprises the following steps: giving an initial video data set, preprocessing each video in the data set, and decomposing each video into a plurality of static attributes and dynamic attributes of a foreground object frame by frame according to a pre-trained video encoder and an attribute decomposition network; respectively carrying out consistency and smoothness processing on the multivariate static attribute and the dynamic attribute according to consistency constraint and smoothness constraint; generating a multivariate attribute model through vector splicing;
the method comprises the following steps:
step 1-1, video preprocessing, the method comprising:
detection using pre-trained targetsThe video is subjected to frame-by-frame target detection by the video detector, targets in the video are cut out according to surrounding frames obtained by detection, and finally each cut target is scaled to the size of 224 multiplied by 224 to obtain a processed video frame sequence IiI is 1, …, T, i represents the ith frame, and T represents the total number of frames in the video.
Step 1-2, decomposing video attributes; the method comprises the following steps:
utilizing a pre-trained video encoder to pre-process the video sequence I obtained in the step 1-1iEncoding into frame-by-frame feature vectors fi(ii) a The video encoder comprises a residual network of T shared parameters, whose input is a 224 x 224 three-channel image, and whose output is an n-dimensional feature vector;
decomposing video into frame-by-frame geometry f by using pre-trained attribute decomposition networki gTexture fi aPosture fi pIllumination fi lAnd deformation fi m(ii) a The attribute decomposition network comprises 5 sub-networks, including: a geometric estimation network, a texture estimation network, an attitude estimation network, an illumination estimation network and a deformation estimation network; wherein the geometric estimation network is a single-layer fully-connected layer, the input of the geometric estimation network is n-dimensional feature vectors, and the output is mgFeature vectors of dimensions for geometric estimation; the texture estimation network is a single-layer fully-connected layer, the input of the texture estimation network is n-dimensional feature vectors, and the output of the texture estimation network is maA feature vector of dimensions for texture estimation; the illumination estimation network is a single-layer full-connection layer, the input of the illumination estimation network is an n-dimensional characteristic vector, and the output of the illumination estimation network is a 27-dimensional characteristic vector for illumination estimation; the attitude estimation network is a single-layer full-connection layer, the input of the attitude estimation network is an n-dimensional characteristic vector, and the output of the attitude estimation network is a 6-dimensional characteristic vector for attitude estimation; the deformation estimation network is a single-layer full-connection layer, the input of the deformation estimation network is n-dimensional characteristic vectors, and the output of the deformation estimation network is mgFeature vectors of dimensions for deformation estimation;
step 1-3, preprocessing attributes; the method comprises static attribute consistency processing and dynamic attribute smoothness processing;
the static attribute consistency processing method comprises the following steps: consider T n-dimensional vectors { fiIf i ═ 1, …, T }, which expresses the same object, then the T n-dimensional vectors should be kept as consistent as possible, and the consistency vector f is solved using the following objective functioncon
Figure BDA0003607550030000091
The frame-by-frame geometry f obtained in step 1-1i gAnd texture fi aRespectively input into a static attribute consistency processing method to obtain geometry after consistency processing
Figure BDA0003607550030000092
And texture
Figure BDA0003607550030000093
The dynamic attribute smoothness processing method comprises the following steps: consider T n-dimensional column vectors { fiIf i 1, …, T expresses the state of the same object at T consecutive times, then the T n-dimensional vectors should change as smoothly as possible. Splicing T column vectors into an n multiplied by T matrix in the column direction according to the time sequence; the matrix is divided according to lines to obtain n vectors f 'of T dimension'jWhere j is 1, …, n, each vector is processed as follows:
f″j=kernel*f′j
where denotes a discrete convolution operation, kernel ═ 0.0545,0.244,0.403,0.244,0.0545]Denotes a one-dimensional convolution kernel, vector f'jPerforming convolution operation by using kernel in a mode of step length being 1 to obtain a smoothed result f ″j(ii) a Smoothing all n-dimensional vectors in the above manner, splicing the vectors back into an n multiplied by T matrix according to the sequence of rows, and splitting the matrix according to columns to obtain T smoothed n-dimensional column vectors; the frame-by-frame attitude f obtained in the step 1-1i pLight irradiation fi lAnd deformation fi mInputting a dynamic attribute smoothness processing method to obtain a smoothed attitude
Figure BDA0003607550030000101
Illumination of light
Figure BDA0003607550030000102
And deformation of
Figure BDA0003607550030000103
1-4, splicing multiple attribute vectors; the method comprises the following steps: for the geometry obtained from step 1-2
Figure BDA0003607550030000104
Texture
Figure BDA0003607550030000105
Posture
Figure BDA0003607550030000106
Illumination of light
Figure BDA0003607550030000107
And deformation of
Figure BDA0003607550030000108
And carrying out vector splicing operation so as to construct a multivariate attribute model:
Figure BDA0003607550030000109
wherein<·,·,…,·>Operation of stitching, S, representing a plurality of vectorsiAnd representing the result of splicing the multi-element attribute vectors of the ith frame.
And 1-5, repeating the steps 1-1 to 1-3 for each video in the given initial video data set until each video in the video data set is processed, and completing the construction of the multivariate attribute model.
Step 2, generating a multi-attribute embedding space, wherein the method comprises the following steps: constructing a self-coding network based on a neural network; mapping the multivariate attribute model constructed in the step 1 to a low-dimensional embedding space through an encoder in a self-coding network; a decoder in the self-coding network is restored into a multi-element attribute model, and the self-coding network is trained through constrained reconstruction loss; generating a multi-attribute embedding space by recording the numerical value range of the low-dimensional embedding space; the method comprises the following steps:
step 2-1, constructing a self-coding network; the method comprises the following steps:
constructing a self-coding network based on a neural network, wherein the self-coding network comprises an encoder and a decoder;
wherein the encoder comprises m cell units sharing the parameter, the τ -th cell unit receives the cell state C of the τ -1 st cell unitτ-1Hidden state hτ-1And current time input xτAs an input; cell state C exported as the current cellτAnd a hidden state hτThe update rule of the self-coding network is as follows:
fτ=σ(Wf·<hτ-1,xτ>+bf)
pτ=σ(Wp·<hτ-1,xτ>+bp)
Figure BDA0003607550030000111
Figure BDA0003607550030000112
oτ=σ(Wo·<hτ-1,xτ>+bo)
hτ=oτ*tanh(Cτ)
wherein the content of the first and second substances,<·,·>representing the concatenation operation of two vectors, σ being a sigmoid function, tanh being a tanh function, (W)f,bf),(Wp,bp),(WC,bC),(Wo,bo) Respectively representWeights and offsets of the same four fully connected layers; the input of the encoder is a multivariate attribute model, and the output is 1 t-dimensional characteristic code fattr
The decoder has a structure consistent with that of the dynamic mode coding network, and the input of the decoder is a t-dimensional vector fattrThe output is a multivariate attribute model; in the update rule, the encoder's cell state C at time 11And hidden state h1Set to all 0 vectors, input x at all times of the decoderτAre all set to fattrWhere τ is 1, …, m.
Step 2-2, self-coding training; the method comprises the following steps:
training the self-coding network in the step 2-1 by adopting a random gradient descent method, wherein a loss function comprises three parts: a reconstruction loss lsigrecA smoothing loss lsmAnd a loss of consistency lconsis(ii) a The reconstruction loss calculation method comprises the following steps:
Figure BDA0003607550030000113
wherein S isiRepresents the ith vector of the multivariate attribute model,
Figure BDA0003607550030000114
representing the ith vector of the reconstructed multivariate attribute model;
the smoothing loss calculation method is as follows:
Figure BDA0003607550030000115
the calculation method of the multivariate attribute matrix S comprises the following steps: all vectors in the reconstructed multivariate attribute model
Figure BDA0003607550030000116
Superimposing a matrix in the column direction, [:]denotes slicing operation, [ m ]g+ma:,:]Indicates that the m < th > matrix is selecteda+mgThe elements of the row to the last row,
Figure BDA0003607550030000117
representing the second derivative in the x-direction of the matrix;
the consistency loss is calculated as follows:
Figure BDA0003607550030000121
finally, the total loss of training for the entire network is as follows:
l=λ1lsigrec2lsmoo3lconsis
wherein λ is123Is a balance factor.
And 2-3, generating a multi-attribute embedding space. The method comprises the following steps:
feature vector generation using the encoder in step 2-1 for all multivariate attribute models
Figure BDA0003607550030000122
The coding range of each dimension is calculated and stored in two vectors rl,rhWherein, the vector rlEach dimension records the minimum of the feature vector for that dimension, vector rhEach dimension records the maximum value of the feature vector in the dimension, and the multi-attribute embedding space is composed of all vectors r meeting the following conditions: { r | rl,α≤rα≤rh,αWhere, r is 1, …, t }, wherel,αIs represented by rlValue of the alpha dimension, rαValue representing the alpha dimension of the vector r, rh,aIs represented by rhThe value of the alpha dimension.
Step 3, synthesizing the multi-element attribute, wherein the method comprises the following steps: sampling from the multi-element attribute embedded space, and generating the multi-element attribute of each frame by using the decoder trained in the step 2; calculating the three-dimensional geometry and the texture of each frame of target object by using a shape synthesis method and an illumination synthesis method; rendering to generate a video sample; and repeating the multi-attribute synthesis process to obtain a plurality of specified video samples, and finally finishing the generation of the video samples synthesized based on the multi-attribute.
Step 3-1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in step 2-2
Figure BDA0003607550030000123
Figure BDA0003607550030000124
Step 3-2, generating dynamic attributes and static attributes: resolving static attribute part from multi-attribute model obtained by decoding
Figure BDA0003607550030000125
And dynamic Properties section
Figure BDA0003607550030000126
Further splitting it into geometry
Figure BDA0003607550030000127
Texture
Figure BDA0003607550030000128
Illumination of light
Figure BDA0003607550030000129
Posture
Figure BDA00036075500300001210
And deformation of
Figure BDA00036075500300001211
Generating representative static attributes, geometries
Figure BDA00036075500300001212
Texture
Figure BDA00036075500300001213
Step 3-3, synthesizing a geometric shape: t frame three-dimensional grid model
Figure BDA00036075500300001214
The calculation method of (2) is as follows:
Figure BDA00036075500300001215
wherein G isavgThe method is an average geometric shape defined in a three-dimensional deformable model 3DMM, the dimensionality of the average geometric shape is 3n, and n represents the number of grid vertexes; each 3 dimensions represents x, y, z coordinate values of a three-dimensional vertex; gneIs a neutral shape base defined in 3DMM and has the same dimension as GavgEach 3 dimensions represent the offset of a three-dimensional vertex coordinate value; gmorIs a deformation base of the shape defined in 3DMM, the dimension and meaning are the same as Gne(ii) a G obtained finallymeshThe vector is a vector with 3n of dimensionality, and is normalized into an n x 3 matrix through the heavy shape of the vector, wherein each row of the matrix represents a coordinate value of a model three-dimensional grid vertex;
step 3-4, synthesizing texture illumination: firstly, synthesizing textures, and restoring a calculation formula of per-vertex color values as follows:
Figure BDA0003607550030000131
wherein, TavgA vertex-by-vertex texture that is a 3d mm defined mean geometry with dimensions of 3n each 3 dimensions representing the R, G, B values of a three-dimensional vertex; t isneIs a texture base defined in 3DMM and has the same dimension as TavgEach 3 dimensions represent the offset of R, G and B values of a three-dimensional vertex; the T obtained finallymeshThe vector with 3n dimensionality is normalized into an n multiplied by 3 matrix through the heavy shape, wherein each row of the matrix represents the R, G and B values of the vertex of a model three-dimensional grid;
and (3) synthesizing illumination: frame t illumination
Figure BDA0003607550030000132
The synthesis method comprises the following steps:
Figure BDA0003607550030000133
wherein the content of the first and second substances,
Figure BDA0003607550030000134
a color value representing the ith vertex of the synthesized texture,
Figure BDA0003607550030000135
a value representing the jth dimension of the illumination of the tth frame, and Y (j) represents the jth base of the spherical harmonic; the operation of each vertex is as above to obtain the color of the synthesized illumination per-vertex texture
Figure BDA0003607550030000136
Step 3-5, rendering: the synthesized t-th frame three-dimensional geometric shape
Figure BDA0003607550030000137
Synthesized illuminated vertex-by-vertex texture
Figure BDA0003607550030000138
And attitude
Figure BDA0003607550030000139
Sending into a renderer to draw a t frame image ItAnd mask image M of t-th frametRandomly cropping from another background picture and ItImages I of the same sizebAnd synthesizing the final video frame, wherein the synthesizing method comprises the following steps:
Figure BDA00036075500300001310
wherein |, indicates element-by-element multiplication;
3-6, writing the generated T frame image into a video stream to obtain a final video sample;
and 3-7, repeating the steps 3-1 to 3-6 until the set number is reached, and finishing the generation of the video sample based on the multivariate attribute synthesis.
Examples
As shown in fig. 1, the method for generating a video sample based on multivariate attribute synthesis disclosed in the present invention is specifically implemented according to the following steps:
step 1, constructing a multivariate attribute model: firstly, decomposing multi-element static attributes (geometry, texture) and dynamic attributes (posture, illumination and deformation) of a foreground object frame by frame from a video according to a pre-trained video encoder and an attribute decomposition network, and then respectively carrying out consistency and smoothness processing on the static attributes and the dynamic attributes according to consistency constraint and smoothness constraint; and finally generating a multi-element attribute model through vector splicing.
Step 2, generating a multi-attribute embedding space: and (3) constructing a self-encoder based on a neural network, mapping the multi-element attribute model constructed in the step (1) to a low-dimensional embedding space through the encoder of the self-encoding network, reducing the multi-element attribute model into the multi-element attribute model through a decoder, and training the self-encoding network through constraint reconstruction loss. And generating the multi-attribute embedding space by recording the value range of the low-dimensional embedding space.
Step 3, multi-element attribute synthesis: sampling from a multi-element attribute embedding space, generating the multi-element attribute of each frame by utilizing a trained decoder, calculating the three-dimensional geometry and the texture of a target object of each frame by utilizing a shape synthesis method and an illumination synthesis method, and finally rendering to generate a video sample. Repeating the multivariate attribute synthesis process may result in multiple video samples specified by the user.
The main flow of each step is described in detail below
Wherein, step 1 includes the following steps:
step 1.1, decomposing video attributes: video I with pre-trained video encoder i1, …, T is encoded as a frame-by-frame feature vector { fiL 1, …, T, decomposing the frame-by-frame feature vector by using the attribute decomposition network to obtain the frame-by-frame geometry { f i g1, …, T }, texture { f |i a1, …, T, pose { f |i p1, …, T }, illumination { f |i l1, …, T |, and the deformation { f |i m|i,…,T}。
Step 1.2, attribute preprocessing, considering T n-dimensional vectors { f }iIf i is 1, …, T, which expresses the same object, then the T n-dimensional vectors should be kept as consistent as possible, and static attribute consistency processing is performed on the frame-by-frame geometry and texture to obtain consistency processed geometry
Figure BDA0003607550030000141
And texture
Figure BDA0003607550030000142
Consider T n-dimensional column vectors { f i1, …, T, which expresses the state of the same object at T consecutive times, then the T n-dimensional vectors should change as smoothly as possible, and the frame-by-frame pose, illumination and deformation should be smoothed to obtain the smoothed pose
Figure BDA0003607550030000143
Figure BDA0003607550030000144
Illumination of light
Figure BDA0003607550030000145
And deformation of
Figure BDA0003607550030000146
Step 1.3, multi-element attribute vector splicing, for the geometry obtained from step 1.2
Figure BDA0003607550030000147
Texture
Figure BDA0003607550030000148
Posture
Figure BDA0003607550030000151
Illumination of light
Figure BDA0003607550030000152
And deformation of
Figure BDA0003607550030000153
Obtaining a multivariate attribute model by using vector splicing operation
Figure BDA0003607550030000154
And 1.4, repeating the steps 1-1 to 1-3 for each video in the original video data set until each video in the video data set is processed.
The step 2 comprises the following steps:
step 2.1, constructing a self-coding network, wherein the encoder comprises m cell units, the m cell units share parameters, and the ith cell unit receives the cell state C of the (i-1) th cell uniti-1Hidden state hi-1And current time input xiAs input, and outputting the cell state C of the current celliAnd a hidden state hiThe decoder has a structure consistent with a dynamic mode coding network, and is different in that the input of the decoder is a t-dimensional vector fattrThe output is a multivariate attribute model. In the update rule, the difference is the cell state C of the encoder at time 11And hidden state h1Set to the full 0 vector, input x at all times of the decoderiI is 1, …, m is set as fattr
And 2.2, training a self-coding network, wherein the self-coding network constructed in the step 2.1 is trained by adopting a back propagation and random gradient descent method, and the reconstruction loss, the smoothing loss and the consistency loss are minimized.
Step 2.3, generating a multi-element attribute embedding space, and generating feature vectors for all multi-element attribute models by using the encoder in the step 2-1
Figure BDA0003607550030000155
Calculating the coding range of each dimension and storing the two vectorsrl,rhWherein r islEach dimension records the minimum value, r, of the feature vector in that dimensionhEach dimension records the maximum value of that dimension of the feature vector.
Step 3 comprises the following steps
Step 3.1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in the step 2.2
Figure BDA0003607550030000156
Figure BDA0003607550030000157
Step 3.2, generating dynamic attributes and static attributes, and disassembling static attribute parts from the multi-attribute model obtained by decoding
Figure BDA0003607550030000158
And dynamic Properties section
Figure BDA0003607550030000159
And further split it into geometries
Figure BDA00036075500300001510
Texture
Figure BDA00036075500300001511
Illumination of light
Figure BDA00036075500300001512
Posture
Figure BDA00036075500300001513
And deformation of
Figure BDA00036075500300001514
Next, representative static attributes, geometries, are generated
Figure BDA00036075500300001515
Texture of
Figure BDA00036075500300001516
And 3.3, synthesizing a geometric shape, and synthesizing the geometric shape and the deformation into a three-dimensional shape according to the 3DMM model.
Step 3.4, texture illumination is synthesized, the texture is synthesized into vertex-by-vertex color values according to the 3DMM model, and the vertex-by-vertex texture colors after the synthesized illumination are obtained according to the spherical harmonic illumination formula
Step 3.5, rendering, namely, synthesizing the t-th frame three-dimensional geometric shape
Figure BDA0003607550030000161
Synthesized illuminated vertex-by-vertex texture
Figure BDA0003607550030000162
And attitude
Figure BDA0003607550030000163
Sending into a renderer to draw a t frame rendering image ItAnd mask image M of t-th frametRandomly cutting out sum I from another background picturetBackground image I of the same sizebAnd blending the rendered image and the background image according to the mask image.
And 3.6, writing the generated T frame image into a video stream to obtain a final video sample.
And 3.7, repeating the steps 3.1 to 3.6 until the number of samples required by the user is met.
In this embodiment, a human face is taken as an example for explanation. The specific implementation process is as follows:
in the step 1, the video data set mainly adopts the video data set in the documents Nagrani A, Chung J S, Zisserman A.Voxceleb: a large-scale marker identification dataset [ J ]. arXIv prediction arXIv:1706.08612,2017. Then step 1.1 is performed for attribute decomposition. The residual network in step 1.1 is constructed in the manner described in the document He K, Zhang X, Ren S, et al. deep residual learning for image registration [ C ]// Proceedings of the IEEE reference on computer vision and pattern registration.2016: 770-. Video coding networks and attribute decomposition networks were pre-trained on the CelebA dataset in the manner described in the documents Deng Y, Yang J, Xu S, et al, accurate 3d face retrieval with well-featured learning From single image to image set [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition works.2019: 0-0. The multivariate property model is then constructed according to steps 1.2 and 1.3.
In step 2, the self-coding network is trained on the multi-element attribute model obtained in the step 1, and a multi-element attribute embedding space is generated. The random gradient descent method described in step 2.1 is carried out in the manner described in the Bottou L.Stochastic gradient device lots [ M ]// Neural networks: lots of the trade. Springer, Berlin, Heidelberg,2012: 421-.
To illustrate some intermediate results in step 3, the present invention is first shown by taking the generation of a single frame as an example. Sampling the multi-attribute embedded space generated in the second step, generating a representative static attribute according to the steps 3.1 and 3.2, setting deformation and illumination to be a full 0 vector, obtaining a human face three-dimensional model and vertex-by-vertex color values by using the steps 3.3 and 3.4, rendering the result of multiple sampling by using the rendering step of the step 3.5 to obtain a result shown in the figure 2, then generating illumination according to the steps 3.1 and 3.2, re-synthesizing the vertex-by-vertex color values according to the step 3.4, and obtaining the result shown in the figure 3 by using the rendering of the step 3.5. Fig. 4 is the corresponding mask image. Wherein the white area represents a face area and the black area represents a background area.
Then, the invention shows how to generate the video multiframes, firstly, the multi-attribute embedded space generated in the step 2 is sampled, representative static attributes and t dynamic attributes are generated according to the steps 3.1 and 3.2, then, the rendered image of each frame is obtained according to the steps 3.3,3.4 and 3.5 by utilizing the static attributes and the dynamic attributes at each moment, and as the number of the rendered video frames is more, the invention takes one picture for each 3 frames and splices to obtain the image as shown in the figure 5. Then, the multi-attribute embedded space generated in the step 2 is sampled, representative static attributes and t dynamic attributes are generated according to the steps 3.1 and 3.2, the representative attributes of the second time are replaced by the representative attributes obtained after the first sampling calculation, a rendered image of each frame is obtained according to the steps 3.3,3.4 and 3.5, a picture is taken from each 3 frames, and the images shown in the figure 6 are obtained by splicing. Fig. 6 shows a new video segment with frames at a sampling interval of 3. And the identity of the person in fig. 6 and 5 is consistent, with the difference being the lighting, deformation, and pose.
3.3,3.4 the 3DMM model is the model described in Paysan P, Knothe R, Amberg B, et al.A 3D surface model for position and estimation innovative surface registration [ C ]//2009six IEEE international conference on advanced video and signal based scientific. Ieee,2009:296-301.
The present invention provides a method and a system for generating a video sample based on multivariate attribute synthesis, and a plurality of methods and approaches for implementing the technical solution are provided, the above description is only a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (10)

1. A video sample generation method based on multivariate attribute synthesis is characterized by comprising the following steps:
step 1, constructing a multi-attribute model, wherein the method comprises the following steps: giving an initial video data set, preprocessing each video in the data set, and decomposing each video into a plurality of static attributes and dynamic attributes of a foreground object frame by frame according to a pre-trained video encoder and an attribute decomposition network; respectively carrying out consistency and smoothness processing on the multivariate static attribute and the dynamic attribute according to consistency constraint and smoothness constraint; generating a multivariate attribute model through vector splicing;
step 2, generating a multi-attribute embedding space, wherein the method comprises the following steps: constructing a self-coding network based on a neural network; mapping the multivariate attribute model constructed in the step 1 to a low-dimensional embedding space through an encoder in a self-coding network; a decoder in the self-coding network is restored into a multi-attribute model, and the self-coding network is trained through constrained reconstruction loss; generating a multi-attribute embedding space by recording the numerical value range of the low-dimensional embedding space;
step 3, synthesizing the multivariate attribute, wherein the method comprises the following steps: sampling from the multi-element attribute embedded space, and generating the multi-element attribute of each frame by using the decoder trained in the step 2; calculating the three-dimensional geometry and the texture of each frame of target object by using a shape synthesis method and an illumination synthesis method; rendering to generate a video sample; and repeating the process of synthesizing the multi-attribute to obtain a plurality of appointed video samples, and finally finishing the generation of the video samples synthesized based on the multi-attribute.
2. The method for generating a video sample based on multivariate attribute synthesis as claimed in claim 1, wherein the step 1 comprises the following steps:
step 1-1, video preprocessing, the method comprising:
utilizing a pre-trained target detector to carry out frame-by-frame target detection on the video, cutting out targets in the video according to a detected bounding box, and finally scaling each cut target to the size of 224 multiplied by 224 to obtain a processed video frame sequence Ii1, where T, i denotes the ith frame and T denotes the total number of frames in the video;
step 1-2, decomposing video attributes;
step 1-3, preprocessing attributes;
1-4, splicing multiple attribute vectors;
and 1-5, repeating the steps 1-1 to 1-3 for each video in the given initial video data set until each video in the video data set is processed, and completing the construction of the multivariate attribute model.
3. The method according to claim 2, wherein the video attribute decomposition method in step 1-2 comprises:
utilizing a pre-trained video encoder to pre-process the video sequence I obtained in the step 1-1iEncoding into frame-by-frame feature vectors fi(ii) a The video encoder comprises a residual network of T shared parameters, whose input is a 224 x 224 three-channel image, and whose output is an n-dimensional feature vector;
decomposing video into frame-by-frame geometry f by using pre-trained attribute decomposition networki gTexture fi aPosture fi pIllumination fi lAnd deformation fi m(ii) a The attribute decomposition network comprises 5 sub-networks, including: a geometric estimation network, a texture estimation network, an attitude estimation network, an illumination estimation network and a deformation estimation network; wherein the geometry estimation network is a single-layer fully-connected layer, the input of the geometry estimation network is n-dimensional feature vectors, and the output of the geometry estimation network is mgFeature vectors of dimensions for geometric estimation; the texture estimation network is a single-layer fully-connected layer, the input of the texture estimation network is n-dimensional feature vectors, and the output of the texture estimation network is maA feature vector of dimensions for texture estimation; the illumination estimation network is a single-layer full-connection layer, the input of the illumination estimation network is an n-dimensional feature vector, and the output of the illumination estimation network is a 27-dimensional feature vector for illumination estimation; the attitude estimation network is a single-layer full-connection layer, the input of the attitude estimation network is an n-dimensional characteristic vector, and the output of the attitude estimation network is a 6-dimensional characteristic vector for attitude estimation; the deformation estimation network is a single-layer full-connection layer, the input of the deformation estimation network is n-dimensional eigenvector, and the output of the deformation estimation network is mgFeature vectors of dimensions for deformation estimation.
4. The method for generating a video sample based on multivariate attribute synthesis as claimed in claim 3, wherein the attribute preprocessing method in step 1-3 comprises static attribute consistency processing and dynamic attribute smoothness processing;
the static attribute consistency processing method comprises the following steps: solving the consistency vector f by using the following objective functioncon
Figure FDA0003607550020000021
The frame-by-frame geometry f obtained in step 1-1i gAnd texture fi aRespectively input into a static attribute consistency processing method to obtain the geometry after consistency processing
Figure FDA0003607550020000022
And texture
Figure FDA0003607550020000023
The dynamic attribute smoothness processing method comprises the following steps: splicing the T column vectors into an n multiplied by T matrix in the column direction according to the time sequence; the matrix is divided according to lines to obtain n vectors f 'of T dimension'jWherein j 1.. n, each vector is processed as follows:
f″j=kernel*fj
where denotes a discrete convolution operation, kernel ═ 0.0545,0.244,0.403,0.244,0.0545]Representing a one-dimensional convolution kernel, vector fj' obtaining a smoothed result f by performing convolution operation with kernel in a manner of step size 1j"; after smoothing all the n-dimensional vectors in the mode, splicing the vectors back into an n multiplied by T matrix according to the sequence of rows, and splitting the matrix according to columns to obtain T smoothed n-dimensional column vectors; the frame-by-frame attitude f obtained in the step 1-1i pIllumination fi lAnd deformation fi mInputting a dynamic attribute smoothness processing method to obtain a smoothed attitude
Figure FDA0003607550020000031
Illumination of light
Figure FDA0003607550020000032
And deformation of
Figure FDA0003607550020000033
5. The method according to claim 4, wherein the method for generating the video sample based on multivariate attribute synthesis in steps 1-4 comprises: for the geometry obtained from step 1-2
Figure FDA0003607550020000034
Texture
Figure FDA0003607550020000035
Posture
Figure FDA0003607550020000036
Illumination of light
Figure FDA0003607550020000037
And deformation of
Figure FDA0003607550020000038
And performing vector splicing operation to construct a multivariate attribute model:
Figure FDA0003607550020000039
wherein<·,·,...,·>Operation of stitching, S, representing a plurality of vectorsiAnd representing the result of splicing the multi-element attribute vectors of the ith frame.
6. The method according to claim 5, wherein step 2 comprises:
step 2-1, constructing a self-coding network;
step 2-2, self-coding training;
and 2-3, generating a multi-attribute embedding space.
7. The method according to claim 6, wherein the method for constructing the self-coding network in step 2-1 comprises:
constructing a self-coding network based on a neural network, wherein the self-coding network comprises an encoder and a decoder;
wherein the encoder comprises m cell units sharing parameters, the τ th cell unit receiving the cell state C of the τ -1 st cell unitτ-1Hidden state hτ-1And current time input xτAs an input; cell state C exported as the current cellτAnd a hidden state hτThe update rule of the self-coding network is as follows:
fτ=σ(Wf·<hτ-1,xτ>+bf)
pτ=σ(Wp·<hτ-1,xτ>+bp)
Figure FDA00036075500200000310
Figure FDA00036075500200000311
oτ=σ(Wo·<hτ-1,xτ>+bo)
hτ=oτ*tanh(Cτ)
wherein the content of the first and second substances,<·,·>represents the splicing operation of two vectors, sigma is sigmoid function, tanh is tanh function, (W)f,bf),(Wp,bp),(WC,bC),(Wo,bo) Respectively representing the weight and the offset of four different fully-connected layers; the input of the encoder is a multivariate attribute model, and the output is 1 t-dimensional characteristic code fattr
The decoder structure and dynamic mode coding networkThe input to the decoder is a t-dimensional vector fattrThe output is a multivariate attribute model; in the update rule, the encoder's cell state C at time 11And hidden state h1Set to all 0 vectors, input x at all times of the decoderτAre all set to fattrWhere τ is 1.
8. The method according to claim 7, wherein the self-coding training method in step 2-2 comprises:
training the self-coding network in the step 2-1 by adopting a random gradient descent method, wherein a loss function comprises three parts: a reconstruction loss lsigrecA smoothing loss lsmoothAnd a loss of consistency lconsis(ii) a The reconstruction loss calculation method comprises the following steps:
Figure FDA0003607550020000041
wherein S isiRepresents the ith vector of the multivariate attribute model,
Figure FDA0003607550020000042
representing the ith vector of the reconstructed multivariate attribute model;
the smoothing loss calculation method is as follows:
Figure FDA0003607550020000043
the calculation method of the multivariate attribute matrix S comprises the following steps: all vectors in the reconstructed multivariate attribute model
Figure FDA0003607550020000044
Superimposed in the column direction as a matrix, [: ,:]denotes slicing operation, [ m ]g+ma:,:]Indicates that the m < th > matrix is selecteda+mgThe elements of the row to the last row,
Figure FDA0003607550020000045
representing the second derivative in the x direction of the matrix;
the consistency loss is calculated as follows:
Figure FDA0003607550020000046
finally, the total loss of training for the entire network is as follows:
l=λ1lsigrec2lsmooth3lconsis
wherein λ is1,λ2,λ3Is a balance factor.
9. The method according to claim 8, wherein the method for generating the multivariate attribute-based composite video sample in step 2-3 comprises:
feature vector generation using the encoder in step 2-1 for all multivariate attribute models
Figure FDA00036075500200000515
The coding range of each dimension is calculated and stored in two vectors rl,rhWherein, the vector rlEach dimension records the minimum of the feature vector for that dimension, vector rhEach dimension records the maximum value of the feature vector in the dimension, and the multi-attribute embedding space is composed of all vectors r meeting the following conditions: { r | rl,α≤rα≤rh,α1, a, t }, wherein rl,αIs represented by rlValue of the alpha dimension, rαValue representing the alpha dimension of the vector r, rh,αIs represented by rhThe value of the alpha dimension.
10. The method according to claim 9, wherein step 3 comprises:
step 3-1, sampling a feature vector from the multi-element attribute embedding space, and decoding the feature vector into a multi-element attribute model according to the decoder obtained by training in step 2-2
Figure FDA0003607550020000051
Figure FDA0003607550020000052
Step 3-2, generating dynamic attributes and static attributes: resolving static attribute part from multi-attribute model obtained by decoding
Figure FDA0003607550020000053
And dynamic Properties section
Figure FDA0003607550020000054
Further splitting it into geometries
Figure FDA0003607550020000055
Texture
Figure FDA0003607550020000056
Illumination of light
Figure FDA0003607550020000057
Posture
Figure FDA0003607550020000058
And deformation of
Figure FDA0003607550020000059
Generating representative static attributes, geometries
Figure FDA00036075500200000510
Texture
Figure FDA00036075500200000511
Step 3-3, synthesizing a geometric shape: t frame three-dimensional grid model
Figure FDA00036075500200000512
The calculation method of (2) is as follows:
Figure FDA00036075500200000513
wherein G isavgThe method is an average geometric shape defined in a three-dimensional deformable model 3DMM, the dimensionality of the average geometric shape is 3n, and n represents the number of grid vertexes; each 3 dimensions represent x, y and z coordinate values of a three-dimensional vertex; gneIs a neutral shape group as defined in 3DMM and has the same dimension as GavgEach 3 dimensions represents the offset of a three-dimensional vertex coordinate value; gmorIs a deformation base of the shape defined in 3DMM, the dimension and meaning are the same as Gne(ii) a G obtained finallymeshThe vector with 3n dimensionality is normalized into an n multiplied by 3 matrix through the heavy shape, wherein each row of the matrix represents a coordinate value of a model three-dimensional grid vertex;
step 3-4, texture synthesis illumination: firstly, synthesizing textures, and restoring a calculation formula of per-vertex color values as follows:
Figure FDA00036075500200000514
wherein, TavgA vertex-by-vertex texture of average geometry defined by 3DMM with dimensions 3n representing the R, G, B values of a three-dimensional vertex every 3 dimensions; t is a unit ofneIs a texture base defined in 3DMM and has the same dimension as TavgEach 3 dimensions represent the offset of R, G and B values of a three-dimensional vertex; the final TmeshIs a vector of dimension 3n, which is normalized by reshaping it into an n x 3 matrix, where each row of the matrix is represented by a weightRepresenting the R, G and B values of the vertexes of a three-dimensional mesh of the model;
and (3) synthesizing illumination: frame t illumination
Figure FDA0003607550020000061
The synthesis method comprises the following steps:
Figure FDA0003607550020000062
wherein the content of the first and second substances,
Figure FDA0003607550020000063
a color value representing the ith vertex of the synthesized texture,
Figure FDA0003607550020000064
a value representing the jth dimension of the illumination of the tth frame, and Y (j) represents the jth base of the spherical harmonic; the operation of each vertex is as above to obtain the color of the synthesized illumination per-vertex texture
Figure FDA0003607550020000065
Step 3-5, rendering: the synthesized t-th frame three-dimensional geometric shape
Figure FDA0003607550020000066
Synthesized illuminated per-vertex texture
Figure FDA0003607550020000067
And attitude
Figure FDA0003607550020000068
Sending into a renderer to draw a t frame image ItAnd mask image M of t-th frametRandomly cropping from another background picture and ItImages I of the same sizebAnd synthesizing the final video frame, wherein the synthesizing method comprises the following steps:
Figure FDA0003607550020000069
wherein |, indicates element-by-element multiplication;
step 3-6, writing the generated T frame image into a video stream to obtain a final video sample;
and 3-7, repeating the steps 3-1 to 3-6 until the set number is reached, and finishing the generation of the video sample based on the multivariate attribute synthesis.
CN202210423708.2A 2022-04-21 2022-04-21 Video sample generation method based on multivariate attribute synthesis Pending CN114694081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210423708.2A CN114694081A (en) 2022-04-21 2022-04-21 Video sample generation method based on multivariate attribute synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210423708.2A CN114694081A (en) 2022-04-21 2022-04-21 Video sample generation method based on multivariate attribute synthesis

Publications (1)

Publication Number Publication Date
CN114694081A true CN114694081A (en) 2022-07-01

Family

ID=82144208

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210423708.2A Pending CN114694081A (en) 2022-04-21 2022-04-21 Video sample generation method based on multivariate attribute synthesis

Country Status (1)

Country Link
CN (1) CN114694081A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499396A (en) * 2022-11-16 2022-12-20 北京红棉小冰科技有限公司 Information generation method and device with personality characteristics
CN116843862A (en) * 2023-08-29 2023-10-03 武汉必盈生物科技有限公司 Three-dimensional thin-wall model grid surface texture synthesis method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499396A (en) * 2022-11-16 2022-12-20 北京红棉小冰科技有限公司 Information generation method and device with personality characteristics
CN115499396B (en) * 2022-11-16 2023-04-07 北京红棉小冰科技有限公司 Information generation method and device with personality characteristics
CN116843862A (en) * 2023-08-29 2023-10-03 武汉必盈生物科技有限公司 Three-dimensional thin-wall model grid surface texture synthesis method
CN116843862B (en) * 2023-08-29 2023-11-24 武汉必盈生物科技有限公司 Three-dimensional thin-wall model grid surface texture synthesis method

Similar Documents

Publication Publication Date Title
Lee et al. Context-aware synthesis and placement of object instances
CN109389671B (en) Single-image three-dimensional reconstruction method based on multi-stage neural network
CN110390638B (en) High-resolution three-dimensional voxel model reconstruction method
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
CN111368662B (en) Method, device, storage medium and equipment for editing attribute of face image
CN114694081A (en) Video sample generation method based on multivariate attribute synthesis
CN116958453B (en) Three-dimensional model reconstruction method, device and medium based on nerve radiation field
Huang et al. Ponder: Point cloud pre-training via neural rendering
Yun et al. Joint face super-resolution and deblurring using generative adversarial network
Chen et al. Domain adaptation for underwater image enhancement via content and style separation
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN112634438A (en) Single-frame depth image three-dimensional model reconstruction method and device based on countermeasure network
Lei et al. NITES: A non-parametric interpretable texture synthesis method
RU2713695C1 (en) Textured neural avatars
Kouzani et al. Towards invariant face recognition
Zhang et al. DIMNet: Dense implicit function network for 3D human body reconstruction
CN110322548B (en) Three-dimensional grid model generation method based on geometric image parameterization
CN117173445A (en) Hypergraph convolution network and contrast learning multi-view three-dimensional object classification method
CN111311732A (en) 3D human body grid obtaining method and device
CN116452715A (en) Dynamic human hand rendering method, device and storage medium
CN113129347B (en) Self-supervision single-view three-dimensional hairline model reconstruction method and system
Laradji et al. SSR: Semi-supervised Soft Rasterizer for single-view 2D to 3D Reconstruction
De Souza et al. Fundamentals and challenges of generative adversarial networks for image-based applications
Shangguan et al. 3D human pose dataset augmentation using generative adversarial network
Johnston et al. Single View 3D Point Cloud Reconstruction using Novel View Synthesis and Self-Supervised Depth Estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination