Summary of the invention
It is a general object of the present disclosure to provide a kind of method for generating video, device, storage medium and electronic equipment, to
Solve problems of the prior art.
To achieve the goals above, according to the first aspect of the embodiments of the present disclosure, a kind of method generating video, institute are provided
The method of stating includes:
The three-primary-color image of source view is inputted into depth and semantic network, obtains the depth of the depth and semantic network output
Degree figure and grapheme;
By the grapheme and the three-primary-color image input feature vector encoder network, the feature coding device net is obtained
The characteristic pattern of network output;
For each module and carriage transformation matrix in multiple continuous module and carriage transformation matrixes of the source view, according to the pose
Transformation matrix and the depth map, convert the grapheme and the characteristic pattern respectively, obtain corresponding to each described
The target grapheme and target signature of module and carriage transformation matrix, the multiple continuous module and carriage transformation matrix are the source view phases
For the module and carriage transformation matrix of the difference of multiple continuous picture frames;
According to the target grapheme and target signature for corresponding to each module and carriage transformation matrix, image is generated respectively
Frame obtains multiple continuous picture frames, wherein each described image frame and the source view are the figures of same target different perspectives
Picture;
The multiple continuous picture frame is synthesized into video.
Optionally, each pose in multiple continuous module and carriage transformation matrixes for the source view converts square
Battle array, according to the module and carriage transformation matrix and the depth map, converts the grapheme and the characteristic pattern respectively, comprising:
The pixel is calculated by following formula for each pixel in the characteristic pattern and the grapheme respectively to exist
Coordinate in first picture frame:
[pt]=dK [R | t] K-1[ps]
[R | t]=[Rs|ts]-1[Rt|tt]
Wherein, d represents the depth value in the depth map at the pixel, and K represents the internal reference of camera, described in [R | t] is represented
Source view is with respect to the module and carriage transformation matrix of the first image frame, and R, which is represented, to be rotated, and t represents translation, [Rs|ts]、[Rt|tt] respectively
Represent pose of the camera under world coordinate system, p under the source view and the first image framesIndicate the pixel in the source
Coordinate under view, ptIndicate the coordinate under first picture frame.
Optionally, the basis corresponds to the target grapheme and target signature of each module and carriage transformation matrix, point
It Sheng Cheng picture frame, comprising:
Processing, institute are optimized according to the target grapheme and target signature that correspond to each module and carriage transformation matrix
Stating optimization processing includes: holes filling and skew control;
Described according to the target grapheme after the optimization for corresponding to each module and carriage transformation matrix and after optimizing
Target signature generates described image frame respectively.
Optionally, the basis corresponds to the target grapheme and target signature of each module and carriage transformation matrix, point
Not Sheng Cheng picture frame, obtaining multiple continuous picture frames includes:
For the target grapheme and target signature for corresponding to each module and carriage transformation matrix, by the target grapheme
The generator network generated in confrontation network is inputted with the target signature, obtains the figure for corresponding to the module and carriage transformation matrix
As frame.
Optionally, the loss function for generating confrontation network are as follows:
Wherein, λFFor super ginseng, it to be used for controlling feature match penaltiesSignificance level, λWJoin to be super,Indicate that image is sentenced
Image impairment in other device network, the loss function of described image arbiter network are as follows:
Indicate the image impairment in video decision device network, the loss function of the video decision device network are as follows:
Also, the function of the characteristic matching loss are as follows:
Wherein G represents generator network, and D represents arbiter network, DkMultiple dimensioned arbiter network is represented, k represents institute
State the number of multiple dimensioned arbiter network, D1, D2The multiple dimensioned arbiter network of two different scales is respectively represented,K-th of multiple dimensioned arbiter network in described image arbiter network is represented,Represent the video decision
K-th of multiple dimensioned arbiter network in device network, behalf source view, x represent target view, and n represents perceptron
The number of plies, NiThe number of every layer of element is represented,Represent the corresponding multiple dimensioned arbiter network D of i-th layer of feature extractork, | | |
|11 norm is represented, GAN, which is represented, generates confrontation network;
Represent light stream loss, the function of the light stream loss are as follows:
Wherein, the number of T representative image sequence, wt、It respectively represents in image sequence between t frame and t+1 frame
True light stream and prediction light stream, xt+1The image of t+1 frame is represented,It represents and combines Optic flow information, by xtThe mapping of frame image
To xt+1The corresponding image of frame;
The training for generating confrontation network is that the loss function is maximized and minimized by following formula
Alternating training:
Wherein,It indicates are as follows:
According to the second aspect of an embodiment of the present disclosure, a kind of device for generating video is provided, described device includes:
First obtains module, for the three-primary-color image of source view to be inputted depth and semantic network, obtains the depth
And the depth map and grapheme of semantic network output;
Second obtains module, for obtaining the grapheme and the three-primary-color image input feature vector encoder network
The characteristic pattern exported to the feature coding device network;
Conversion module, for converting square for each pose in multiple continuous module and carriage transformation matrixes of the source view
Battle array, according to the module and carriage transformation matrix and the depth map, converts the grapheme and the characteristic pattern respectively, obtains pair
Should be in the target grapheme and target signature of each module and carriage transformation matrix, the multiple continuous module and carriage transformation matrix
Module and carriage transformation matrix of the source view relative to the difference of multiple continuous picture frames;
Generation module corresponds to the target grapheme and target signature of each module and carriage transformation matrix for basis,
Picture frame is generated respectively, obtains multiple continuous picture frames, wherein each described image frame and the source view are same targets
The image of different perspectives;
Synthesis module, for the multiple continuous picture frame to be synthesized video.
Optionally, the conversion module includes:
Computational submodule, each pixel for being directed in the characteristic pattern and the grapheme respectively pass through following public
Formula calculates coordinate of the pixel in the first picture frame:
[pt]=dK [R | t] K-1[ps]
[R | t]=[Rs|ts]-1[Rt|tt]
Wherein, d represents the depth value in the depth map at the pixel, and K represents the internal reference of camera, described in [R | t] is represented
Source view is with respect to the module and carriage transformation matrix of the first image frame, and R, which is represented, to be rotated, and t represents translation, [Rs|ts]、[Rt|tt] respectively
Represent pose of the camera under world coordinate system, p under the source view and the first image framesIndicate the pixel in the source
Coordinate under view, ptIndicate the coordinate under first picture frame.
Optionally, the generation module includes:
Optimize submodule, for according to the target grapheme and target signature for corresponding to each module and carriage transformation matrix
Processing is optimized, the optimization processing includes: holes filling and skew control;
First generates submodule, for according to the target language after the optimization for corresponding to each module and carriage transformation matrix
The target signature after justice figure and optimization, generates described image frame respectively.
Optionally, the generation module further include:
Second generates submodule, for special for the target grapheme and target for corresponding to each module and carriage transformation matrix
The target grapheme and target signature input are generated the generator network in confrontation network, obtain the correspondence by sign figure
In the picture frame of the module and carriage transformation matrix.
Optionally, the loss function for generating confrontation network are as follows:
Wherein, λFFor super ginseng, it to be used for controlling feature match penaltiesSignificance level, λWJoin to be super,Indicate that image is sentenced
Image impairment in other device network, the loss function of described image arbiter network are as follows:
Indicate the image impairment in video decision device network, the loss function of the video decision device network are as follows:
Also, the function of the characteristic matching loss are as follows:
Wherein G represents generator network, and D represents arbiter network, DkMultiple dimensioned arbiter network is represented, k represents institute
State the number of multiple dimensioned arbiter network, D1, D2The multiple dimensioned arbiter network of two different scales is respectively represented,K-th of multiple dimensioned arbiter network in described image arbiter network is represented,Represent the video decision
K-th of multiple dimensioned arbiter network in device network, behalf source view, x represent target view, and n represents perceptron
The number of plies, NiThe number of every layer of element is represented,Represent the corresponding multiple dimensioned arbiter network D of i-th layer of feature extractork, | | |
|11 norm is represented, GAN, which is represented, generates confrontation network;
Represent light stream loss, the function of the light stream loss are as follows:
Wherein, the number of T representative image sequence, wt、It respectively represents in image sequence between t frame and t+1 frame
True light stream and prediction light stream, xt+1The image of t+1 frame is represented,It represents and combines Optic flow information, by xtFrame image is mapped to
xt+1The corresponding image of frame;
The training for generating confrontation network is that the loss function is maximized and minimized by following formula
Alternating training:
Wherein,It indicates are as follows:
According to the third aspect of an embodiment of the present disclosure, the disclosure additionally provides a kind of computer readable storage medium, thereon
Computer program instructions are stored with, the step of disclosure first aspect the method is realized when which is executed by processor
Suddenly.
According to a fourth aspect of embodiments of the present disclosure, a kind of electronic equipment is provided, comprising:
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, to realize described in disclosure first aspect
The step of method.
By adopting the above technical scheme, by corresponding to the continuous pose sequence and depth map of source view, by source view
Grapheme and characteristic pattern carry out geometric transformation, continuous multiple target graphemes and continuous multiple mesh can be respectively obtained
Characteristic pattern is marked, multiple target grapheme and its corresponding target signature are then synthesized into multiple continuous images respectively again
Frame.Again by these continuous picture frame synthetic videos.In this way, the depth map of source view, grapheme and feature are utilized
Figure can deduce the three-dimensional structure of invisible area, and keep its true texture, to keep the picture frame generated more clear
It is clear and true to nature.In this way, the video generated in this way is more life-like, stability is more preferable.
Other feature and advantage of the disclosure will the following detailed description will be given in the detailed implementation section.
Specific embodiment
It is described in detail below in conjunction with specific embodiment of the attached drawing to the disclosure.It should be understood that this place is retouched
The specific embodiment stated is only used for describing and explaining the disclosure, is not limited to the disclosure.
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended
The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
It should be noted that the specification and claims of the disclosure and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being interpreted as specific sequence or precedence.
In order to make those skilled in the art be easier to understand the technical solution of embodiment of the present disclosure offer, first below to this
The open related notion being related to simply is introduced.
Computer vision is referred to and is simulated using computer and relevant device to biological vision.Its main task is just
It is to be handled by picture to acquisition or video to obtain the three-dimensional information of corresponding scene.Further, just refer to
The machine vision for replacing human eye to be identified, tracked to target with video camera and computer and being measured, and graphics process is further done,
Computer is set to be treated as the image for being more suitable for eye-observation or sending instrument detection to.
Visual angle refers to the vertical direction angulation of sight and display etc..It specifically refers to, when observing object, from
The light of (the upper and lower or left and right) extraction in object both ends at the optical center of human eye formed by angle.
Become multi-view image, refers to for same three-dimensional scenic, by the image of the different perspectives mapped three-dimensional scenic.
Three-primary-color image refers to RGB (Red Green Blue, abbreviation RGB) image, by three Color Channels of red, green, blue
The image of composition.
Depth map, depth image are also referred to as range image, refer to the distance of each point in from image acquisition device to scene is (deep
Degree) image as pixel value.It has directly reacted the geometry of scenery visible surface in scene, i other words, depth image
It is the three dimensional representation form of object.
Grapheme refers to that machine is divided automatically and identifies the content in image.Specifically referring to will figure from pixel scale
Different objects as in are split, and are marked the object classification of its representative and positioned and detect the position of the object in the picture
It sets.
Characteristic pattern, in each convolutional layer in convolutional neural networks, data are all existing in three dimensions.It can be
It regards many a two-dimension pictures as and stacks, and wherein each is known as a characteristic pattern.That is, in convolutional neural networks
Each layer in image carry out multiple angles description, refer specifically to carry out volume operation with different convolution collecting images, obtain
Response on different core (core can be understood as the description above), the feature as image.I other words characteristic pattern is convolution
Core rolls out the result come.
Pose refers to position and posture of the object in image in specified coordinate system, describes the relative position of object
With relative motion track.Image under different visual angles has different poses.
Hole refers to image after treatment, region of the pixel without value or extreme value occurs.Such as eight company inside bianry image
Pixel point set in the closed circle of logical dot matrix composition.
Bilinear interpolation, also referred to as bilinear interpolation method are using the pixel value of 4 neighbouring points, according to it away from interior
The distance for inserting point assigns different weights, carries out linear interpolation.This method have equalization low-pass filter effect, edge by
Smoothing effect, and generate the coherent output image of a comparison.
Resampling refers to the process of and goes out another kind of picture dot information according to the message interpolation of a kind of pixel.It is a kind of image number
According to processing method.Gray scale processing method i.e. during image data reorganization.Image sampling is to acquire shadow at regular intervals
As gray values, when threshold value is not located at the numerical value of the original function on sampled point, it is necessary to using in the progress of sampled point
It inserts, referred to as resampling.
Pre-training just refers to the process of a model of training in advance or refers to preparatory training pattern.
Robustness refers to that control system under the Parameter Perturbation of certain (structure, size), maintains the spy of other certain performances
Property.
Stability refers to that control system after making it deviate the disappearance of the perturbation action of equilibrium state, returns to original equilibrium-like
The ability of state.
Space-time consistency, over time and space consistent feature.
The prediction of light stream, be using the image in image sequence pixel in time-domain variation and consecutive frame between
Correlation find previous frame with corresponding relationship existing between present frame, to calculate the movement of object between consecutive frame
A kind of method of information.
The embodiment of the present disclosure provides a kind of method for generating video, as shown in Figure 1, this method comprises:
S101, the three-primary-color image of source view is inputted into depth and semantic network, obtains the depth and semantic network is defeated
Depth map and grapheme out.
Source view is handled using the good depth of pre-training and semantic network, obtains the corresponding grapheme of source view
And depth map.Specifically, semantic segmentation and depth are carried out to source view using the good semantic segmentation of pre-training and depth prediction network
Degree prediction, wherein semantic segmentation and depth prediction network can be deep neural network, such as can be convolutional neural networks.
Illustratively, source view is inputted in depth and semantic network, is rolled up by one 3 × 3 convolution collecting image
Product processing operation, then the convolutional layer exports the 2-D data of a new image.And it is worth noting that using same big
Small different convolution collecting images are handled, and the different characteristic of image can be extracted, for example, profile, color, texture etc..
Then new two-dimensional signal is inputted in next convolutional layer and is handled.After convolutional layer, full articulamentum is entered data into,
Full articulamentum output is one-dimensional vector, is in the network in object classification since the one-dimensional vector represents objects in images
The probability of object, therefore handled by the semantic segmentation of the network, this is known that according to the one-dimensional vector that full articulamentum exports
Object in image respectively represents any object.It can obtain above-mentioned grapheme.For example, if having one in the image of input
The people of motorcycle being ride, then, after semantic segmentation, people and Che can be split, and by the area where people in image
Domain mark is people, and the area marking where motorcycle is motorcycle.For another example if there are two people in the image of input, wherein one
Individual rides motorcycle, and after semantic segmentation, two people regions in image are marked as people, motorcycle region
It is labeled as motorcycle.In one possible implementation, the people that ride motorcycle can also be marked as people 1, by another
People's mark is people 2.
It is worth noting that source view described in step S101 can be any in the video of video camera shooting one
Picture frame is also possible to the single image of video camera shooting.
In addition, source view is inputted in depth and semantic network, the corresponding depth map of the available image.Depth image
The depth information of scene in the image is reacted.By obtaining the depth value of each pixel, it is known that each point in the scene
To the distance of camera plane, therefore, depth map can directly react the geometry information of scenery visible surface in the scene.And
And due to intensive image slices vegetarian refreshments, according to its intensive depth map information, it can speculate that the three-dimensional of invisible area object is believed
Breath.
S102, by the grapheme and the three-primary-color image input feature vector encoder network, obtain the feature and compile
The characteristic pattern of code device network output.
In order to make the target image frame generated and source view that there is continuity on space-time, in other words, in order to make
The image of generation maintains original feature of all scenery in the view of source, for example, shape feature, color characteristic, textural characteristics and
The feature that feature coding device network comes in extraction source view can be used in spatial relation characteristics etc..Wherein color feature
The surface nature of scenery corresponding to image or image-region;Shape feature is divided into two kinds, and one is contour feature, another kind is
Provincial characteristics, the contour feature of image is mainly for the outer boundary of object, and the provincial characteristics of image is then related to entire shape
Region;Spatial relation characteristics refer to the mutual spatial position or relative direction between the multiple scenery split in image
Relationship, these relationships can also be divided into connection or syntople, overlapping or overlapping relation, include or containment relationship etc..
It is worth noting that can be low-dimensional vector using the feature that feature coding device extracts, it is also possible to high dimension vector,
In other words these features can be low-level image feature, be also possible to high-level characteristic, and the disclosure does not limit this.Specifically, lead to
Feature coding device is crossed, the characteristics of image of bottom, the i.e. marginal information of low-dimensional vector expression is obtained, then carries out feature combination, obtain
To the characteristics of image on upper layer, i.e. the high-level characteristic information of high dimension vector expression.By feature extraction, features described above figure can be made to protect
Hold the real features of source images.
Therefore, in step s 102, by by the three-primary-color image input feature vector encoder network of grapheme and source view
In, the characteristic pattern of the available three-primary-color image, this feature figure maintains original feature letter of each example in grapheme
Breath, wherein example refers to independent individual, and people 1 and people 2 as escribed above can be respectively two examples.
S103, for each module and carriage transformation matrix in multiple continuous module and carriage transformation matrixes of the source view, according to
The module and carriage transformation matrix and the depth map, convert the grapheme and the characteristic pattern respectively, obtain corresponding to every
The target grapheme and target signature of one module and carriage transformation matrix.
Wherein, the multiple continuous module and carriage transformation matrix is point of the source view relative to multiple continuous picture frames
Other module and carriage transformation matrix, in other words, multiple continuous module and carriage transformation matrixes are multiple target views relative to source view
Module and carriage transformation matrix, and the continuous module and carriage transformation matrix can be user's input.It is converted according to multiple continuous poses
The depth map of each module and carriage transformation matrix and source view in matrix, grapheme and characteristic pattern to source view become respectively
It changes, obtains the target grapheme and target signature corresponding to each module and carriage transformation matrix.Also, it is converted according to continuous pose
Matrix, which obtains corresponding multiple target graphemes and multiple target signatures, also has continuity.
Specifically, by multiple continuous module and carriage transformation matrixes, the different perspectives of available same three-dimensional scenic it is more
A continuous image, it can pass through the available multiple target images of module and carriage transformation matrix.In a kind of possible embodiment
In, such as in the case where unknown pose, can be used visual odometry (Visual Odometry, VO) or it is directly sparse in
Journey meter (Direct Sparse Odometry, DSO) etc. handles image sequence, to obtain the corresponding pose of each image
Data [R | t]=[R | t]1,[R|t]2,…,[R|t]n, wherein [R | t]1The pose of piece image is represented, [R | t]nIt represents
The pose of n-th width image.
Optionally, for each module and carriage transformation matrix in multiple continuous module and carriage transformation matrixes of the source view, root
According to the module and carriage transformation matrix and depth map, grapheme and characteristic pattern are converted respectively, can with the following steps are included:
The pixel is calculated in the first image by following formula for each pixel in characteristic pattern and grapheme respectively
Coordinate in frame:
[pt]=dK [R | t] K-1[ps]
[R | t]=[Rs|ts]-1[Rt|tt]
Wherein, d represents the depth value in the depth map at the pixel, and K represents the internal reference of camera, described in [R | t] is represented
Source view is with respect to the module and carriage transformation matrix of the first image frame, and R, which is represented, to be rotated, and t represents translation, [Rs|ts]、[Rt|tt] respectively
Represent pose of the camera under world coordinate system, p under the source view and the first image framesIndicate the pixel in the source
Coordinate under view, ptIndicate the coordinate under first picture frame.
It, can be by the characteristic pattern of source view and grapheme by module and carriage transformation matrix, by source using above-mentioned calculation method
Each of the characteristic pattern and grapheme of view pixel is mapped in the first picture frame.Wherein the first picture frame can be
Either objective image in above-mentioned multiple target signatures and multiple target graphemes.Therefore, in the pose sequence for knowing image
In the case where column, the image under another arbitrary pose can be obtained according to the corresponding image of any one pose.
In one possible implementation, two adjacent poses can also be divided into N equal portions, example according to demand
Such as, by pose [R | t]1With pose [R | t]2Between pose variation be divided into N equal portions, obtain the new pose data of N-1.Then
Pose of the arbitrary pose data as target view in pose data after choosing segmentation.So, then by above-mentioned
Calculation method calculates the grapheme and characteristic pattern of arbitrary target view according to the grapheme of source view and characteristic pattern.It is this
Method can be used for being inserted into more picture frames between two adjacent picture frames.
In addition, due to the coordinate p converted by module and carriage transformation matrixtAnd non-integer, and hence it is also possible to use two-wire
Property interpolation method in 4 adjacent areas numerical value carry out resampling so that transformed image is more smooth.
S104, basis correspond to the target grapheme and target signature of each module and carriage transformation matrix, generate respectively
Picture frame obtains multiple continuous picture frames.
Wherein, each described image frame and the source view are the images of same target different perspectives.I other words described every
One described image frame and the source view are the images under the different positions and pose for same three-dimensional scenic.
In step S104, image can be generated according to the corresponding target grapheme of Zhang Yuan's view and target signature
Frame can also generate picture frame according to the corresponding target grapheme of multiple source views and target signature.For example, in above-mentioned steps
The as described in the examples of S103 uses above-mentioned calculation method, can be used for being inserted between two adjacent picture frames more
Picture frame.So, that is to say, that in known two picture frames, and to when being inserted into more images between two picture frames, may be used also
To obtain the image under same pose respectively according to known two picture frames.Then the image under two same poses is closed
As the image under the pose, that is, target image frame.In this way, by two pose images, it is available to be somebody's turn to do to more
The characteristic information of image under same pose.Therefore the image under the object pose obtained can be made more true.To
Keep the picture frame generated more life-like.
S105, the multiple continuous picture frame is synthesized into video.
Using the above method, by corresponding to the continuous pose sequence and depth map of source view, by the language of source view
Justice figure and characteristic pattern carry out geometric transformation, can respectively obtain continuous multiple target graphemes and continuous multiple targets are special
Then multiple target grapheme and its corresponding target signature are synthesized multiple continuous picture frames respectively again by sign figure.
These continuous picture frames are synthesized into video again.In this way, the depth map of source view, grapheme and characteristic pattern are utilized
The three-dimensional structure of invisible area can be deduced, and keeps its true texture, so that the picture frame generated be made to be more clear
With it is true to nature.So, the video synthesized using the picture frame is also more true to nature, improves the stability of video.In addition, using this
Kind of method, due to generating multiple continuous picture frames after the image of source view, therefore, this method can be also used for
More picture frames are inserted between two continuous picture frames.Such as more images are inserted into two frame of front and back into video
Frame.In this way, video bag can be made containing more picture frames, the frame per second of video is improved, the frame per second of video camera is indirectly improved,
So as to improve the continuity of video so that video is more smooth.
Optionally, according to the target grapheme and target signature for corresponding to each module and carriage transformation matrix, figure is generated respectively
As frame, can with the following steps are included:
Processing is optimized according to the target grapheme and target signature that correspond to each module and carriage transformation matrix, the optimization
Processing includes: holes filling and skew control;
According to the target grapheme after the optimization for corresponding to each module and carriage transformation matrix and the target signature after optimization, divide
It Sheng Cheng not described image frame.
During converting to obtain target grapheme and target signature by module and carriage transformation matrix, due to being deposited in image
In invisible area, i.e., the invisible area formed is blocked by foreground object under current visual angle, under the visual angle of source view not
Visibility region may be visible under the visual angle of target view, then, the target grapheme and target signature converted
May there are pixel lack part, i.e. hole.To solve this problem, in disclosure the method, optimization can be passed through
Network carries out holes filling to target grapheme and target signature.Also, in calculating target grapheme and target signature
Each pixel coordinate when, in fact it could happen that error, cause pixel coordinate occur error, to make the scenery in image
Generate distortion.Therefore, it in disclosure the method, can be corrected by image of the optimization processing to distortion.Specifically,
Processing can be optimized to image using optimization network, wherein the loss function of optimization network are as follows:
Wherein,The whole loss of representing optimized network,Pixel L1 loss is represented,Represent perception loss, λ generation
The super ginseng of table.
ForDepth convolutional network (the Very deep for large-scale image identification can be used
Convolutional networks for large-scale image recognition, abbreviation VGG network) it mentions respectively
The image of generation and the feature of true image are taken, and the L1 loss calculated between the two is mean absolute error, by the loss
AsNumerical value.
In this way, target grapheme and target signature are subjected to holes filling and skew control, can made excellent
Image after change is more life-like.
In addition, it should also be noted that, the method that can use a variety of suppositions, optimizes target image frame.It is a variety of
Speculate the source view progress pose transformation referred to according under multiple different positions and poses, thus it is speculated that obtain the target view under multiple same poses
The method of figure.The reason is that, the information of the same three-dimensional scenic by seeing under different perspectives, that is, different positions and pose is different, tool
Body, when different perspectives watches same three-dimensional scenic, due to visual angle difference, then blocking caused invisible area by foreground object
Domain is different, therefore, can carry out a variety of suppositions to target image by the picture frame of the front and back of target image frame, this
Sample can integrate the much information of the image under different positions and pose, carry out more to the information of invisible area in the view of source
Accurately speculate, so that the target image sequence generated is more life-like, to keep the video generated more life-like and smooth.
Optionally, according to the target grapheme and target signature for corresponding to each module and carriage transformation matrix, figure is generated respectively
As frame, when obtaining multiple continuous picture frames, can with the following steps are included:
For the target grapheme and target signature for corresponding to each module and carriage transformation matrix, by the target grapheme and it is somebody's turn to do
Target signature input generates the generator network in confrontation network, obtains the picture frame corresponding to the module and carriage transformation matrix.
It should be appreciated by those skilled in the art that high-resolution and texture figure true to nature can be synthesized by generating confrontation network
Picture.Specifically, generating confrontation network includes generator network and arbiter network.The purpose for generating confrontation network is generated with vacation
The sample to look genuine, even if the sample that generator network generates is very true, it is true to distinguish to reach the scarce capacity of arbiter network
The dummy copy of sample and generation.Wherein, in the disclosure, this sample refers to image.
Optionally, the loss function of confrontation network is generated are as follows:
Wherein, λFFor super ginseng, it to be used for controlling feature match penaltiesSignificance level, λWJoin to be super,Indicate that image is sentenced
Image impairment in other device network, the loss function of described image arbiter network are as follows:
Indicate the image impairment in video decision device network, the loss function of the video decision device network are as follows:
Also, the function of the characteristic matching loss are as follows:
Wherein G represents generator network, and D represents arbiter network, DkMultiple dimensioned arbiter network is represented, k represents institute
State the number of multiple dimensioned arbiter network, D1, D2The multiple dimensioned arbiter network of two different scales is respectively represented,K-th of multiple dimensioned arbiter network in described image arbiter network is represented,Represent the video decision
K-th of multiple dimensioned arbiter network in device network, behalf source view, x represent target view, and n represents perceptron
The number of plies, NiThe number of every layer of element is represented, which layer i represents,Represent the corresponding multiple dimensioned differentiation of i-th layer of feature extractor
Device network Dk, | | | |11 norm is represented, GAN, which is represented, generates confrontation network;
Represent light stream loss, the function of the light stream loss are as follows:
Wherein, the number of T representative image sequence, wt、It respectively represents in image sequence between t frame and t+1 frame
True light stream and prediction light stream, xt+1The image of t+1 frame is represented,It represents and combines Optic flow information, by xtFrame image is mapped to
xt+1The corresponding image of frame;
The training for generating confrontation network is that the loss function is maximized and minimized by following formula
Alternating training:
Wherein,It indicates are as follows:
Specifically sentence using multiple dimensioned image it is worth noting that using multiple dimensioned arbiter in this programme
Other device DIWith multiple dimensioned video decision device DV, facilitate the convergence of network using multiple dimensioned arbiter and accelerate to train, and can
To reduce the repetition boxed area on the target image generated.
In addition, the space-time consistency in order to keep generating picture frame, while source view image being sent into and generates confrontation network
In, carry out the prediction of light stream, the loss between comparison prediction light stream and true light stream.It should be appreciated by those skilled in the art that can
To utilize convolutional network study light stream (Learning Optical Flow with Convolutional Networks, abbreviation
FlowNet)。
In this way, the real information of three-dimensional scenic can be kept using the picture frame for generating confrontation network generation,
And the space-time expending generated between picture frame and front and back picture frame can also be enhanced using the prediction of light stream.
In conclusion being input with the three-primary-color image of source view, according to source view using method described in the disclosure
Depth map, characteristic pattern, grapheme and the module and carriage transformation matrix with target view, generate the image of free-position.And it will generate
Image optimize, in conjunction with the prediction of light stream, the Space Consistency for generating image and source view can be kept.It is regarded using source
The depth map of figure, grapheme and characteristic pattern can deduce the three-dimensional structure of invisible area, and keep its true texture, from
And make generate picture frame be more clear with it is true to nature.In this way, the video generated in this way is more life-like, stability is more
It is good.In addition, it is worth noting that, disclosure the method can be applied to vSLAM and build figure, VO positioning and 3D reconstruction etc., this
It is open that this is not construed as limiting.For example, the initialization of vSLAM may be will affect if the frame per second of video camera acquisition image is too low, thus
Cause vSLAM to build figure to interrupt, it is poor to cause to build figure effect;For another example VO is determined by analysis processing associated image sequences
The position of every frame data of video camera shooting and posture facilitate the precision for promoting VO positioning if promoting the frame per second of video camera
And stability;For another example vision 3D, which is rebuild, mainly obtains object image data in scene by video camera, and image is divided
Analysis processing, in conjunction with computer vision and graphics techniques, rebuilds the threedimensional model of object.If the frame per second of acquisition image improves,
It can make the difference very little of adjacent two field pictures, in this way, can help improve the precision of model.Therefore, according to above-mentioned
Method, the filling of image interframe data may be implemented, indirectly promoted video camera frame per second, thus increase video continuity and
Stability, and then promote precision and robustness that vSLAM, VO, 3D are rebuild.
The embodiment of the present disclosure also provides a kind of device for generating video, for implementing one kind of above method embodiment offer
The step of generating the method for video.As shown in Fig. 2, the device 200 includes:
First obtains module 210, for the three-primary-color image of source view to be inputted depth and semantic network, obtains the depth
Degree and the depth map and grapheme of semantic network output;
Second obtains module 220, is used for the grapheme and the three-primary-color image input feature vector encoder network,
Obtain the characteristic pattern of the feature coding device network output;
Conversion module 230, for becoming for each pose in multiple continuous module and carriage transformation matrixes of the source view
Matrix is changed, according to the module and carriage transformation matrix and the depth map, the grapheme and the characteristic pattern is converted respectively, obtained
To the target grapheme and target signature for corresponding to each module and carriage transformation matrix, the multiple continuous pose converts square
Battle array is module and carriage transformation matrix of the source view relative to the difference of multiple continuous picture frames;
Generation module 240, for according to the target grapheme and target signature for corresponding to each module and carriage transformation matrix
Figure, generates picture frame respectively, obtains multiple continuous picture frames, wherein each described image frame and the source view are same
The image of object different perspectives;
Synthesis module 250, for the multiple continuous picture frame to be synthesized video.
Using above-mentioned apparatus, by corresponding to the continuous pose sequence and depth map of source view, by the language of source view
Justice figure and characteristic pattern carry out geometric transformation, can respectively obtain continuous multiple target graphemes and continuous multiple targets are special
Then multiple target grapheme and its corresponding target signature are synthesized multiple continuous picture frames respectively again by sign figure.
These continuous picture frames are synthesized into video again.Using this device, using the depth map of source view, grapheme and characteristic pattern
The three-dimensional structure of invisible area can be deduced, and keeps its true texture, so that the picture frame generated be made to be more clear
With it is true to nature.In this way, more life-like using the video that this device generates, stability is more preferable.In addition, due to the image in source view
After frame, generate multiple continuous picture frames, therefore use this device, can be also used for into video two it is continuous or do not connect
More picture frames are inserted between continuous picture frame.In this way, the video generated can be made to include more picture frames, view is improved
The frame per second of frequency indirectly improves the frame per second of video camera, so as to improve the continuity of video so that video is more smooth.
Optionally, as shown in figure 3, conversion module 230 further include:
Computational submodule 231, each pixel for being directed in the characteristic pattern and the grapheme respectively is by such as
Lower formula calculates coordinate of the pixel in the first picture frame:
[pt]=dK [R | t] K-1[ps]
[R | t]=[Rs|ts]-1[Rt|tt]
Wherein, d represents the depth value in the depth map at the pixel, and K represents the internal reference of camera, described in [R | t] is represented
Source view is with respect to the module and carriage transformation matrix of the first image frame, and R, which is represented, to be rotated, and t represents translation, [Rs|ts]、[Rt|tt] respectively
Represent pose of the camera under world coordinate system, p under the source view and the first image framesIndicate the pixel in the source
Coordinate under view, ptIndicate the coordinate under first picture frame.
Optionally, as shown in figure 3, generation module 240 further include:
Optimize submodule 241, for special according to the target grapheme and target that correspond to each module and carriage transformation matrix
Sign figure optimizes processing, and the optimization processing includes: holes filling and skew control;
First generates submodule 242, for according to the mesh after the optimization for corresponding to each module and carriage transformation matrix
The target signature after marking grapheme and optimization, generates described image frame respectively.
Optionally, as shown in figure 3, generation module 240 further include:
Second generates submodule 243, for for the target grapheme and mesh corresponding to each module and carriage transformation matrix
Characteristic pattern is marked, the target grapheme and target signature input are generated into the generator network in confrontation network, obtained described
Picture frame corresponding to the module and carriage transformation matrix.
Optionally, the loss function of confrontation network is generated are as follows:
Wherein, λFFor super ginseng, it to be used for controlling feature match penaltiesSignificance level, λWJoin to be super,Indicate that image is sentenced
Image impairment in other device network, the loss function of described image arbiter network are as follows:
Indicate the image impairment in video decision device network, the loss function of the video decision device network are as follows:
Also, the function of the characteristic matching loss are as follows:
Wherein G represents generator network, and D represents arbiter network, DkMultiple dimensioned arbiter network is represented, k represents institute
State the number of multiple dimensioned arbiter network, D1, D2The multiple dimensioned arbiter network of two different scales is respectively represented,K-th of multiple dimensioned arbiter network in described image arbiter network is represented,Represent the video decision
K-th of multiple dimensioned arbiter network in device network, behalf source view, x represent target view, and n represents perceptron
The number of plies, NiThe number of every layer of element is represented,Represent the corresponding multiple dimensioned arbiter network D of i-th layer of feature extractork, | | |
|11 norm is represented, GAN, which is represented, generates confrontation network;
Represent light stream loss, the function of the light stream loss are as follows:
Wherein, the number of T representative image sequence, wt、It respectively represents in image sequence between t frame and t+1 frame
True light stream and prediction light stream, xt+1The image of t+1 frame is represented,It represents and combines Optic flow information, by xtFrame image is mapped to
xt+1The corresponding image of frame;
The training for generating confrontation network is that the loss function is maximized and minimized by following formula
Alternating training:
Wherein,It indicates are as follows:
The disclosure additionally provides a kind of computer readable storage medium, is stored thereon with computer program instructions, the program
The step of a kind of method for generation video that the disclosure provides is realized when instruction is executed by processor.
Fig. 4 is the block diagram of a kind of electronic equipment 400 shown according to an exemplary embodiment.As shown in figure 4, the electronics is set
Standby 400 may include: processor 401, memory 402.The electronic equipment 400 can also include multimedia component 403, input/
Export one or more of (I/O) interface 404 and communication component 405.
Wherein, processor 401 is used to control the integrated operation of the electronic equipment 400, to complete a kind of above-mentioned generation view
All or part of the steps in the method for frequency.Memory 402 is for storing various types of data to support in the electronic equipment
400 operation, these data for example may include any application or method for operating on the electronic equipment 400
Instruction and the relevant data of application program, such as contact data, the message of transmitting-receiving, picture, audio, video etc..This is deposited
Reservoir 402 can realize by any kind of volatibility or non-volatile memory device or their combination, for example, it is static with
Machine accesses memory (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory
(Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), erasable programmable
Read-only memory (Erasable Programmable Read-Only Memory, abbreviation EPROM), programmable read only memory
(Programmable Read-Only Memory, abbreviation PROM), and read-only memory (Read-Only Memory, referred to as
ROM), magnetic memory, flash memory, disk or CD.Multimedia component 403 may include screen and audio component.Wherein
Screen for example can be touch screen, and audio component is used for output and/or input audio signal.For example, audio component may include
One microphone, microphone is for receiving external audio signal.The received audio signal can be further stored in storage
Device 402 is sent by communication component 405.Audio component further includes at least one loudspeaker, is used for output audio signal.I/O
Interface 404 provides interface between processor 401 and other interface modules, other above-mentioned interface modules can be keyboard, mouse,
Button etc..These buttons can be virtual push button or entity button.Communication component 405 is for the electronic equipment 400 and other
Wired or wireless communication is carried out between equipment.Wireless communication, such as Wi-Fi, bluetooth, near-field communication (Near Field
Communication, abbreviation NFC), 2G, 3G, 4G, NB-IOT, eMTC or other 5G etc. or they one or more of
Combination, it is not limited here.Therefore the corresponding communication component 405 may include: Wi-Fi module, bluetooth module, NFC mould
Block etc..
In one exemplary embodiment, electronic equipment 400 can be by one or more application specific integrated circuit
(Application Specific Integrated Circuit, abbreviation ASIC), digital signal processor (Digital
Signal Processor, abbreviation DSP), digital signal processing appts (Digital Signal Processing Device,
Abbreviation DSPD), programmable logic device (Programmable Logic Device, abbreviation PLD), field programmable gate array
(Field Programmable Gate Array, abbreviation FPGA), controller, microcontroller, microprocessor or other electronics member
Part is realized, for the step of executing a kind of method of above-mentioned generation video.
In a further exemplary embodiment, a kind of computer readable storage medium including program instruction is additionally provided, it should
The step of a kind of method of above-mentioned generation video is realized when program instruction is executed by processor.For example, this computer-readable is deposited
Storage media can be the above-mentioned memory 402 including program instruction, and above procedure instruction can be by the processor of electronic equipment 400
401 methods executed to complete a kind of above-mentioned generation video.
Those skilled in the art will readily occur to other embodiment party of the disclosure after considering specification and practicing the disclosure
Case.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or adaptability
Variation follows the general principles of this disclosure and including the undocumented common knowledge or usual skill in the art of the disclosure
Art means.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following claim
It points out.
It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.
The preferred embodiment of the disclosure is described in detail in conjunction with attached drawing above, still, the disclosure is not limited to above-mentioned reality
The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure
Monotropic type, these simple variants belong to the protection scope of the disclosure.
It is further to note that specific technical features described in the above specific embodiments, in not lance
In the case where shield, can be combined in any appropriate way, in order to avoid unnecessary repetition, the disclosure to it is various can
No further explanation will be given for the combination of energy.
In addition, any combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally
Disclosed thought equally should be considered as disclosure disclosure of that.