CN110381268A

CN110381268A - method, device, storage medium and electronic equipment for generating video

Info

Publication number: CN110381268A
Application number: CN201910557145.4A
Authority: CN
Inventors: 王超鹏; 林义闽; 廉士国
Original assignee: Cloudminds Shenzhen Robotics Systems Co Ltd
Current assignee: Cloudminds Shanghai Robotics Co Ltd
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2019-10-25
Anticipated expiration: 2039-06-25
Also published as: CN110381268B

Abstract

The disclosure provides a method, a device, a storage medium and an electronic device for generating video. The method comprises the following steps: inputting the three primary color images of the source view into a depth and semantic network to obtain a depth image and a semantic image output by the depth and semantic network; inputting the semantic graph and the three-primary-color image into a feature encoder network to obtain a feature graph output by the feature encoder network; aiming at each pose transformation matrix in a plurality of continuous pose transformation matrixes of a source view, respectively transforming a semantic graph and a feature graph according to the pose transformation matrix and a depth map to obtain a target semantic graph and a target feature graph corresponding to each pose transformation matrix; respectively generating image frames according to the target semantic graph and the target characteristic graph corresponding to each pose transformation matrix to obtain a plurality of continuous image frames; a plurality of successive image frames are synthesized into a video. Therefore, the generated image frame can be clearer and more vivid, and the generated video is more vivid.

Description

The method for generating video, device, storage medium and electronic equipment

Technical field

This disclosure relates to video technique field, and in particular, to a method of generate video, device, storage medium and Electronic equipment.

Background technique

With the development of computer vision technique, the development of camcorder technology is constantly promoted.The view of video camera shooting Frequency quality has an important influence the research of computer vision technique.For example, the video quality of video camera shooting is to visual synchronization In positioning and mapping (visual simultaneous localization and mapping, abbreviation vSLAM) and vision Positioning, navigation and the three-dimensional of journey meter (Visual Odometry, abbreviation VO) are built figure and are had an important influence.

The video of video camera shooting is made of the static image of a sequence, these static images are known as picture frame.And The frame per second of video is exactly that the bitmap images as unit of frame frequency or rate over the display continuously occur.Video camera acquisition figure The frame per second of picture affects the quality of video, such as the frame per second of video camera acquisition image is too low, video can be made not smooth, before making video Frame is discontinuous afterwards.In traditional method, by adjusting the performance of video camera, the frame per second of video camera can be improved, to improve view The quality of frequency.Intermediate frame image can also be generated to improve the frame per second of video by single width frame image or front and back two field pictures, but It is that the existing method for generating intermediate image frame based on single width or multiple image, the video distortion of synthesis is serious, stability Difference.

Summary of the invention

It is a general object of the present disclosure to provide a kind of method for generating video, device, storage medium and electronic equipment, to Solve problems of the prior art.

To achieve the goals above, according to the first aspect of the embodiments of the present disclosure, a kind of method generating video, institute are provided The method of stating includes:

The three-primary-color image of source view is inputted into depth and semantic network, obtains the depth of the depth and semantic network output Degree figure and grapheme；

By the grapheme and the three-primary-color image input feature vector encoder network, the feature coding device net is obtained The characteristic pattern of network output；

For each module and carriage transformation matrix in multiple continuous module and carriage transformation matrixes of the source view, according to the pose Transformation matrix and the depth map, convert the grapheme and the characteristic pattern respectively, obtain corresponding to each described The target grapheme and target signature of module and carriage transformation matrix, the multiple continuous module and carriage transformation matrix are the source view phases For the module and carriage transformation matrix of the difference of multiple continuous picture frames；

According to the target grapheme and target signature for corresponding to each module and carriage transformation matrix, image is generated respectively Frame obtains multiple continuous picture frames, wherein each described image frame and the source view are the figures of same target different perspectives Picture；

The multiple continuous picture frame is synthesized into video.

Optionally, each pose in multiple continuous module and carriage transformation matrixes for the source view converts square Battle array, according to the module and carriage transformation matrix and the depth map, converts the grapheme and the characteristic pattern respectively, comprising:

The pixel is calculated by following formula for each pixel in the characteristic pattern and the grapheme respectively to exist Coordinate in first picture frame:

[p_t]=dK [R | t] K^-1[p_s]

[R | t]=[R_s|t_s]^-1[R_t|t_t]

Wherein, d represents the depth value in the depth map at the pixel, and K represents the internal reference of camera, described in [R | t] is represented Source view is with respect to the module and carriage transformation matrix of the first image frame, and R, which is represented, to be rotated, and t represents translation, [R_s|t_s]、[R_t|t_t] respectively Represent pose of the camera under world coordinate system, p under the source view and the first image frame_sIndicate the pixel in the source Coordinate under view, p_tIndicate the coordinate under first picture frame.

Optionally, the basis corresponds to the target grapheme and target signature of each module and carriage transformation matrix, point It Sheng Cheng picture frame, comprising:

Processing, institute are optimized according to the target grapheme and target signature that correspond to each module and carriage transformation matrix Stating optimization processing includes: holes filling and skew control；

Described according to the target grapheme after the optimization for corresponding to each module and carriage transformation matrix and after optimizing Target signature generates described image frame respectively.

Optionally, the basis corresponds to the target grapheme and target signature of each module and carriage transformation matrix, point Not Sheng Cheng picture frame, obtaining multiple continuous picture frames includes:

For the target grapheme and target signature for corresponding to each module and carriage transformation matrix, by the target grapheme The generator network generated in confrontation network is inputted with the target signature, obtains the figure for corresponding to the module and carriage transformation matrix As frame.

Optionally, the loss function for generating confrontation network are as follows:

Wherein, λ_FFor super ginseng, it to be used for controlling feature match penaltiesSignificance level, λ_WJoin to be super,Indicate that image is sentenced Image impairment in other device network, the loss function of described image arbiter network are as follows:

Indicate the image impairment in video decision device network, the loss function of the video decision device network are as follows:

Also, the function of the characteristic matching loss are as follows:

Wherein G represents generator network, and D represents arbiter network, D_kMultiple dimensioned arbiter network is represented, k represents institute State the number of multiple dimensioned arbiter network, D₁, D₂The multiple dimensioned arbiter network of two different scales is respectively represented,K-th of multiple dimensioned arbiter network in described image arbiter network is represented,Represent the video decision K-th of multiple dimensioned arbiter network in device network, behalf source view, x represent target view, and n represents perceptron The number of plies, N_iThe number of every layer of element is represented,Represent the corresponding multiple dimensioned arbiter network D of i-th layer of feature extractor_k, | | | |₁1 norm is represented, GAN, which is represented, generates confrontation network；

Represent light stream loss, the function of the light stream loss are as follows:

Wherein, the number of T representative image sequence, w_t、It respectively represents in image sequence between t frame and t+1 frame True light stream and prediction light stream, x_t+1The image of t+1 frame is represented,It represents and combines Optic flow information, by x_tThe mapping of frame image To x_t+1The corresponding image of frame；

The training for generating confrontation network is that the loss function is maximized and minimized by following formula Alternating training:

Wherein,It indicates are as follows:

According to the second aspect of an embodiment of the present disclosure, a kind of device for generating video is provided, described device includes:

First obtains module, for the three-primary-color image of source view to be inputted depth and semantic network, obtains the depth And the depth map and grapheme of semantic network output；

Second obtains module, for obtaining the grapheme and the three-primary-color image input feature vector encoder network The characteristic pattern exported to the feature coding device network；

Conversion module, for converting square for each pose in multiple continuous module and carriage transformation matrixes of the source view Battle array, according to the module and carriage transformation matrix and the depth map, converts the grapheme and the characteristic pattern respectively, obtains pair Should be in the target grapheme and target signature of each module and carriage transformation matrix, the multiple continuous module and carriage transformation matrix Module and carriage transformation matrix of the source view relative to the difference of multiple continuous picture frames；

Generation module corresponds to the target grapheme and target signature of each module and carriage transformation matrix for basis, Picture frame is generated respectively, obtains multiple continuous picture frames, wherein each described image frame and the source view are same targets The image of different perspectives；

Synthesis module, for the multiple continuous picture frame to be synthesized video.

Optionally, the conversion module includes:

Computational submodule, each pixel for being directed in the characteristic pattern and the grapheme respectively pass through following public Formula calculates coordinate of the pixel in the first picture frame:

[p_t]=dK [R | t] K^-1[p_s]

[R | t]=[R_s|t_s]^-1[R_t|t_t]

Optionally, the generation module includes:

Optimize submodule, for according to the target grapheme and target signature for corresponding to each module and carriage transformation matrix Processing is optimized, the optimization processing includes: holes filling and skew control；

First generates submodule, for according to the target language after the optimization for corresponding to each module and carriage transformation matrix The target signature after justice figure and optimization, generates described image frame respectively.

Optionally, the generation module further include:

Second generates submodule, for special for the target grapheme and target for corresponding to each module and carriage transformation matrix The target grapheme and target signature input are generated the generator network in confrontation network, obtain the correspondence by sign figure In the picture frame of the module and carriage transformation matrix.

Also, the function of the characteristic matching loss are as follows:

Wherein, the number of T representative image sequence, w_t、It respectively represents in image sequence between t frame and t+1 frame True light stream and prediction light stream, x_t+1The image of t+1 frame is represented,It represents and combines Optic flow information, by x_tFrame image is mapped to x_t+1The corresponding image of frame；

Wherein,It indicates are as follows:

According to the third aspect of an embodiment of the present disclosure, the disclosure additionally provides a kind of computer readable storage medium, thereon Computer program instructions are stored with, the step of disclosure first aspect the method is realized when which is executed by processor Suddenly.

According to a fourth aspect of embodiments of the present disclosure, a kind of electronic equipment is provided, comprising:

Memory is stored thereon with computer program；

Processor, for executing the computer program in the memory, to realize described in disclosure first aspect The step of method.

By adopting the above technical scheme, by corresponding to the continuous pose sequence and depth map of source view, by source view Grapheme and characteristic pattern carry out geometric transformation, continuous multiple target graphemes and continuous multiple mesh can be respectively obtained Characteristic pattern is marked, multiple target grapheme and its corresponding target signature are then synthesized into multiple continuous images respectively again Frame.Again by these continuous picture frame synthetic videos.In this way, the depth map of source view, grapheme and feature are utilized Figure can deduce the three-dimensional structure of invisible area, and keep its true texture, to keep the picture frame generated more clear It is clear and true to nature.In this way, the video generated in this way is more life-like, stability is more preferable.

Other feature and advantage of the disclosure will the following detailed description will be given in the detailed implementation section.

Detailed description of the invention

Attached drawing is and to constitute part of specification for providing further understanding of the disclosure, with following tool Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:

Fig. 1 is a kind of flow chart of method for generating video shown according to an exemplary embodiment.

Fig. 2 is a kind of block diagram of device for generating video shown according to an exemplary embodiment.

Fig. 3 is the block diagram of another device for generating video shown according to an exemplary embodiment.

Fig. 4 is the block diagram of a kind of electronic equipment shown according to an exemplary embodiment.

Specific embodiment

It is described in detail below in conjunction with specific embodiment of the attached drawing to the disclosure.It should be understood that this place is retouched The specific embodiment stated is only used for describing and explaining the disclosure, is not limited to the disclosure.

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.

It should be noted that the specification and claims of the disclosure and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being interpreted as specific sequence or precedence.

In order to make those skilled in the art be easier to understand the technical solution of embodiment of the present disclosure offer, first below to this The open related notion being related to simply is introduced.

Computer vision is referred to and is simulated using computer and relevant device to biological vision.Its main task is just It is to be handled by picture to acquisition or video to obtain the three-dimensional information of corresponding scene.Further, just refer to The machine vision for replacing human eye to be identified, tracked to target with video camera and computer and being measured, and graphics process is further done, Computer is set to be treated as the image for being more suitable for eye-observation or sending instrument detection to.

Visual angle refers to the vertical direction angulation of sight and display etc..It specifically refers to, when observing object, from The light of (the upper and lower or left and right) extraction in object both ends at the optical center of human eye formed by angle.

Become multi-view image, refers to for same three-dimensional scenic, by the image of the different perspectives mapped three-dimensional scenic.

Three-primary-color image refers to RGB (Red Green Blue, abbreviation RGB) image, by three Color Channels of red, green, blue The image of composition.

Depth map, depth image are also referred to as range image, refer to the distance of each point in from image acquisition device to scene is (deep Degree) image as pixel value.It has directly reacted the geometry of scenery visible surface in scene, i other words, depth image It is the three dimensional representation form of object.

Grapheme refers to that machine is divided automatically and identifies the content in image.Specifically referring to will figure from pixel scale Different objects as in are split, and are marked the object classification of its representative and positioned and detect the position of the object in the picture It sets.

Characteristic pattern, in each convolutional layer in convolutional neural networks, data are all existing in three dimensions.It can be It regards many a two-dimension pictures as and stacks, and wherein each is known as a characteristic pattern.That is, in convolutional neural networks Each layer in image carry out multiple angles description, refer specifically to carry out volume operation with different convolution collecting images, obtain Response on different core (core can be understood as the description above), the feature as image.I other words characteristic pattern is convolution Core rolls out the result come.

Pose refers to position and posture of the object in image in specified coordinate system, describes the relative position of object With relative motion track.Image under different visual angles has different poses.

Hole refers to image after treatment, region of the pixel without value or extreme value occurs.Such as eight company inside bianry image Pixel point set in the closed circle of logical dot matrix composition.

Bilinear interpolation, also referred to as bilinear interpolation method are using the pixel value of 4 neighbouring points, according to it away from interior The distance for inserting point assigns different weights, carries out linear interpolation.This method have equalization low-pass filter effect, edge by Smoothing effect, and generate the coherent output image of a comparison.

Resampling refers to the process of and goes out another kind of picture dot information according to the message interpolation of a kind of pixel.It is a kind of image number According to processing method.Gray scale processing method i.e. during image data reorganization.Image sampling is to acquire shadow at regular intervals As gray values, when threshold value is not located at the numerical value of the original function on sampled point, it is necessary to using in the progress of sampled point It inserts, referred to as resampling.

Pre-training just refers to the process of a model of training in advance or refers to preparatory training pattern.

Robustness refers to that control system under the Parameter Perturbation of certain (structure, size), maintains the spy of other certain performances Property.

Stability refers to that control system after making it deviate the disappearance of the perturbation action of equilibrium state, returns to original equilibrium-like The ability of state.

Space-time consistency, over time and space consistent feature.

The prediction of light stream, be using the image in image sequence pixel in time-domain variation and consecutive frame between Correlation find previous frame with corresponding relationship existing between present frame, to calculate the movement of object between consecutive frame A kind of method of information.

The embodiment of the present disclosure provides a kind of method for generating video, as shown in Figure 1, this method comprises:

S101, the three-primary-color image of source view is inputted into depth and semantic network, obtains the depth and semantic network is defeated Depth map and grapheme out.

Source view is handled using the good depth of pre-training and semantic network, obtains the corresponding grapheme of source view And depth map.Specifically, semantic segmentation and depth are carried out to source view using the good semantic segmentation of pre-training and depth prediction network Degree prediction, wherein semantic segmentation and depth prediction network can be deep neural network, such as can be convolutional neural networks.

Illustratively, source view is inputted in depth and semantic network, is rolled up by one 3 × 3 convolution collecting image Product processing operation, then the convolutional layer exports the 2-D data of a new image.And it is worth noting that using same big Small different convolution collecting images are handled, and the different characteristic of image can be extracted, for example, profile, color, texture etc.. Then new two-dimensional signal is inputted in next convolutional layer and is handled.After convolutional layer, full articulamentum is entered data into, Full articulamentum output is one-dimensional vector, is in the network in object classification since the one-dimensional vector represents objects in images The probability of object, therefore handled by the semantic segmentation of the network, this is known that according to the one-dimensional vector that full articulamentum exports Object in image respectively represents any object.It can obtain above-mentioned grapheme.For example, if having one in the image of input The people of motorcycle being ride, then, after semantic segmentation, people and Che can be split, and by the area where people in image Domain mark is people, and the area marking where motorcycle is motorcycle.For another example if there are two people in the image of input, wherein one Individual rides motorcycle, and after semantic segmentation, two people regions in image are marked as people, motorcycle region It is labeled as motorcycle.In one possible implementation, the people that ride motorcycle can also be marked as people 1, by another People's mark is people 2.

It is worth noting that source view described in step S101 can be any in the video of video camera shooting one Picture frame is also possible to the single image of video camera shooting.

In addition, source view is inputted in depth and semantic network, the corresponding depth map of the available image.Depth image The depth information of scene in the image is reacted.By obtaining the depth value of each pixel, it is known that each point in the scene To the distance of camera plane, therefore, depth map can directly react the geometry information of scenery visible surface in the scene.And And due to intensive image slices vegetarian refreshments, according to its intensive depth map information, it can speculate that the three-dimensional of invisible area object is believed Breath.

S102, by the grapheme and the three-primary-color image input feature vector encoder network, obtain the feature and compile The characteristic pattern of code device network output.

In order to make the target image frame generated and source view that there is continuity on space-time, in other words, in order to make The image of generation maintains original feature of all scenery in the view of source, for example, shape feature, color characteristic, textural characteristics and The feature that feature coding device network comes in extraction source view can be used in spatial relation characteristics etc..Wherein color feature The surface nature of scenery corresponding to image or image-region；Shape feature is divided into two kinds, and one is contour feature, another kind is Provincial characteristics, the contour feature of image is mainly for the outer boundary of object, and the provincial characteristics of image is then related to entire shape Region；Spatial relation characteristics refer to the mutual spatial position or relative direction between the multiple scenery split in image Relationship, these relationships can also be divided into connection or syntople, overlapping or overlapping relation, include or containment relationship etc..

It is worth noting that can be low-dimensional vector using the feature that feature coding device extracts, it is also possible to high dimension vector, In other words these features can be low-level image feature, be also possible to high-level characteristic, and the disclosure does not limit this.Specifically, lead to Feature coding device is crossed, the characteristics of image of bottom, the i.e. marginal information of low-dimensional vector expression is obtained, then carries out feature combination, obtain To the characteristics of image on upper layer, i.e. the high-level characteristic information of high dimension vector expression.By feature extraction, features described above figure can be made to protect Hold the real features of source images.

Therefore, in step s 102, by by the three-primary-color image input feature vector encoder network of grapheme and source view In, the characteristic pattern of the available three-primary-color image, this feature figure maintains original feature letter of each example in grapheme Breath, wherein example refers to independent individual, and people 1 and people 2 as escribed above can be respectively two examples.

S103, for each module and carriage transformation matrix in multiple continuous module and carriage transformation matrixes of the source view, according to The module and carriage transformation matrix and the depth map, convert the grapheme and the characteristic pattern respectively, obtain corresponding to every The target grapheme and target signature of one module and carriage transformation matrix.

Wherein, the multiple continuous module and carriage transformation matrix is point of the source view relative to multiple continuous picture frames Other module and carriage transformation matrix, in other words, multiple continuous module and carriage transformation matrixes are multiple target views relative to source view Module and carriage transformation matrix, and the continuous module and carriage transformation matrix can be user's input.It is converted according to multiple continuous poses The depth map of each module and carriage transformation matrix and source view in matrix, grapheme and characteristic pattern to source view become respectively It changes, obtains the target grapheme and target signature corresponding to each module and carriage transformation matrix.Also, it is converted according to continuous pose Matrix, which obtains corresponding multiple target graphemes and multiple target signatures, also has continuity.

Specifically, by multiple continuous module and carriage transformation matrixes, the different perspectives of available same three-dimensional scenic it is more A continuous image, it can pass through the available multiple target images of module and carriage transformation matrix.In a kind of possible embodiment In, such as in the case where unknown pose, can be used visual odometry (Visual Odometry, VO) or it is directly sparse in Journey meter (Direct Sparse Odometry, DSO) etc. handles image sequence, to obtain the corresponding pose of each image Data [R | t]=[R | t]₁,[R|t]₂,…,[R|t]_n, wherein [R | t]₁The pose of piece image is represented, [R | t]_nIt represents The pose of n-th width image.

Optionally, for each module and carriage transformation matrix in multiple continuous module and carriage transformation matrixes of the source view, root According to the module and carriage transformation matrix and depth map, grapheme and characteristic pattern are converted respectively, can with the following steps are included:

The pixel is calculated in the first image by following formula for each pixel in characteristic pattern and grapheme respectively Coordinate in frame:

[p_t]=dK [R | t] K^-1[p_s]

[R | t]=[R_s|t_s]^-1[R_t|t_t]

It, can be by the characteristic pattern of source view and grapheme by module and carriage transformation matrix, by source using above-mentioned calculation method Each of the characteristic pattern and grapheme of view pixel is mapped in the first picture frame.Wherein the first picture frame can be Either objective image in above-mentioned multiple target signatures and multiple target graphemes.Therefore, in the pose sequence for knowing image In the case where column, the image under another arbitrary pose can be obtained according to the corresponding image of any one pose.

In one possible implementation, two adjacent poses can also be divided into N equal portions, example according to demand Such as, by pose [R | t]₁With pose [R | t]₂Between pose variation be divided into N equal portions, obtain the new pose data of N-1.Then Pose of the arbitrary pose data as target view in pose data after choosing segmentation.So, then by above-mentioned Calculation method calculates the grapheme and characteristic pattern of arbitrary target view according to the grapheme of source view and characteristic pattern.It is this Method can be used for being inserted into more picture frames between two adjacent picture frames.

In addition, due to the coordinate p converted by module and carriage transformation matrix_tAnd non-integer, and hence it is also possible to use two-wire Property interpolation method in 4 adjacent areas numerical value carry out resampling so that transformed image is more smooth.

S104, basis correspond to the target grapheme and target signature of each module and carriage transformation matrix, generate respectively Picture frame obtains multiple continuous picture frames.

Wherein, each described image frame and the source view are the images of same target different perspectives.I other words described every One described image frame and the source view are the images under the different positions and pose for same three-dimensional scenic.

In step S104, image can be generated according to the corresponding target grapheme of Zhang Yuan's view and target signature Frame can also generate picture frame according to the corresponding target grapheme of multiple source views and target signature.For example, in above-mentioned steps The as described in the examples of S103 uses above-mentioned calculation method, can be used for being inserted between two adjacent picture frames more Picture frame.So, that is to say, that in known two picture frames, and to when being inserted into more images between two picture frames, may be used also To obtain the image under same pose respectively according to known two picture frames.Then the image under two same poses is closed As the image under the pose, that is, target image frame.In this way, by two pose images, it is available to be somebody's turn to do to more The characteristic information of image under same pose.Therefore the image under the object pose obtained can be made more true.To Keep the picture frame generated more life-like.

S105, the multiple continuous picture frame is synthesized into video.

Using the above method, by corresponding to the continuous pose sequence and depth map of source view, by the language of source view Justice figure and characteristic pattern carry out geometric transformation, can respectively obtain continuous multiple target graphemes and continuous multiple targets are special Then multiple target grapheme and its corresponding target signature are synthesized multiple continuous picture frames respectively again by sign figure. These continuous picture frames are synthesized into video again.In this way, the depth map of source view, grapheme and characteristic pattern are utilized The three-dimensional structure of invisible area can be deduced, and keeps its true texture, so that the picture frame generated be made to be more clear With it is true to nature.So, the video synthesized using the picture frame is also more true to nature, improves the stability of video.In addition, using this Kind of method, due to generating multiple continuous picture frames after the image of source view, therefore, this method can be also used for More picture frames are inserted between two continuous picture frames.Such as more images are inserted into two frame of front and back into video Frame.In this way, video bag can be made containing more picture frames, the frame per second of video is improved, the frame per second of video camera is indirectly improved, So as to improve the continuity of video so that video is more smooth.

Optionally, according to the target grapheme and target signature for corresponding to each module and carriage transformation matrix, figure is generated respectively As frame, can with the following steps are included:

Processing is optimized according to the target grapheme and target signature that correspond to each module and carriage transformation matrix, the optimization Processing includes: holes filling and skew control；

According to the target grapheme after the optimization for corresponding to each module and carriage transformation matrix and the target signature after optimization, divide It Sheng Cheng not described image frame.

During converting to obtain target grapheme and target signature by module and carriage transformation matrix, due to being deposited in image In invisible area, i.e., the invisible area formed is blocked by foreground object under current visual angle, under the visual angle of source view not Visibility region may be visible under the visual angle of target view, then, the target grapheme and target signature converted May there are pixel lack part, i.e. hole.To solve this problem, in disclosure the method, optimization can be passed through Network carries out holes filling to target grapheme and target signature.Also, in calculating target grapheme and target signature Each pixel coordinate when, in fact it could happen that error, cause pixel coordinate occur error, to make the scenery in image Generate distortion.Therefore, it in disclosure the method, can be corrected by image of the optimization processing to distortion.Specifically, Processing can be optimized to image using optimization network, wherein the loss function of optimization network are as follows:

Wherein,The whole loss of representing optimized network,Pixel L1 loss is represented,Represent perception loss, λ generation The super ginseng of table.

ForDepth convolutional network (the Very deep for large-scale image identification can be used Convolutional networks for large-scale image recognition, abbreviation VGG network) it mentions respectively The image of generation and the feature of true image are taken, and the L1 loss calculated between the two is mean absolute error, by the loss AsNumerical value.

In this way, target grapheme and target signature are subjected to holes filling and skew control, can made excellent Image after change is more life-like.

In addition, it should also be noted that, the method that can use a variety of suppositions, optimizes target image frame.It is a variety of Speculate the source view progress pose transformation referred to according under multiple different positions and poses, thus it is speculated that obtain the target view under multiple same poses The method of figure.The reason is that, the information of the same three-dimensional scenic by seeing under different perspectives, that is, different positions and pose is different, tool Body, when different perspectives watches same three-dimensional scenic, due to visual angle difference, then blocking caused invisible area by foreground object Domain is different, therefore, can carry out a variety of suppositions to target image by the picture frame of the front and back of target image frame, this Sample can integrate the much information of the image under different positions and pose, carry out more to the information of invisible area in the view of source Accurately speculate, so that the target image sequence generated is more life-like, to keep the video generated more life-like and smooth.

Optionally, according to the target grapheme and target signature for corresponding to each module and carriage transformation matrix, figure is generated respectively As frame, when obtaining multiple continuous picture frames, can with the following steps are included:

For the target grapheme and target signature for corresponding to each module and carriage transformation matrix, by the target grapheme and it is somebody's turn to do Target signature input generates the generator network in confrontation network, obtains the picture frame corresponding to the module and carriage transformation matrix.

It should be appreciated by those skilled in the art that high-resolution and texture figure true to nature can be synthesized by generating confrontation network Picture.Specifically, generating confrontation network includes generator network and arbiter network.The purpose for generating confrontation network is generated with vacation The sample to look genuine, even if the sample that generator network generates is very true, it is true to distinguish to reach the scarce capacity of arbiter network The dummy copy of sample and generation.Wherein, in the disclosure, this sample refers to image.

Optionally, the loss function of confrontation network is generated are as follows:

Also, the function of the characteristic matching loss are as follows:

Wherein G represents generator network, and D represents arbiter network, D_kMultiple dimensioned arbiter network is represented, k represents institute State the number of multiple dimensioned arbiter network, D₁, D₂The multiple dimensioned arbiter network of two different scales is respectively represented,K-th of multiple dimensioned arbiter network in described image arbiter network is represented,Represent the video decision K-th of multiple dimensioned arbiter network in device network, behalf source view, x represent target view, and n represents perceptron The number of plies, N_iThe number of every layer of element is represented, which layer i represents,Represent the corresponding multiple dimensioned differentiation of i-th layer of feature extractor Device network D_k, | | | |₁1 norm is represented, GAN, which is represented, generates confrontation network；

Wherein,It indicates are as follows:

Specifically sentence using multiple dimensioned image it is worth noting that using multiple dimensioned arbiter in this programme Other device D_IWith multiple dimensioned video decision device D_V, facilitate the convergence of network using multiple dimensioned arbiter and accelerate to train, and can To reduce the repetition boxed area on the target image generated.

In addition, the space-time consistency in order to keep generating picture frame, while source view image being sent into and generates confrontation network In, carry out the prediction of light stream, the loss between comparison prediction light stream and true light stream.It should be appreciated by those skilled in the art that can To utilize convolutional network study light stream (Learning Optical Flow with Convolutional Networks, abbreviation FlowNet)。

In this way, the real information of three-dimensional scenic can be kept using the picture frame for generating confrontation network generation, And the space-time expending generated between picture frame and front and back picture frame can also be enhanced using the prediction of light stream.

In conclusion being input with the three-primary-color image of source view, according to source view using method described in the disclosure Depth map, characteristic pattern, grapheme and the module and carriage transformation matrix with target view, generate the image of free-position.And it will generate Image optimize, in conjunction with the prediction of light stream, the Space Consistency for generating image and source view can be kept.It is regarded using source The depth map of figure, grapheme and characteristic pattern can deduce the three-dimensional structure of invisible area, and keep its true texture, from And make generate picture frame be more clear with it is true to nature.In this way, the video generated in this way is more life-like, stability is more It is good.In addition, it is worth noting that, disclosure the method can be applied to vSLAM and build figure, VO positioning and 3D reconstruction etc., this It is open that this is not construed as limiting.For example, the initialization of vSLAM may be will affect if the frame per second of video camera acquisition image is too low, thus Cause vSLAM to build figure to interrupt, it is poor to cause to build figure effect；For another example VO is determined by analysis processing associated image sequences The position of every frame data of video camera shooting and posture facilitate the precision for promoting VO positioning if promoting the frame per second of video camera And stability；For another example vision 3D, which is rebuild, mainly obtains object image data in scene by video camera, and image is divided Analysis processing, in conjunction with computer vision and graphics techniques, rebuilds the threedimensional model of object.If the frame per second of acquisition image improves, It can make the difference very little of adjacent two field pictures, in this way, can help improve the precision of model.Therefore, according to above-mentioned Method, the filling of image interframe data may be implemented, indirectly promoted video camera frame per second, thus increase video continuity and Stability, and then promote precision and robustness that vSLAM, VO, 3D are rebuild.

The embodiment of the present disclosure also provides a kind of device for generating video, for implementing one kind of above method embodiment offer The step of generating the method for video.As shown in Fig. 2, the device 200 includes:

First obtains module 210, for the three-primary-color image of source view to be inputted depth and semantic network, obtains the depth Degree and the depth map and grapheme of semantic network output；

Second obtains module 220, is used for the grapheme and the three-primary-color image input feature vector encoder network, Obtain the characteristic pattern of the feature coding device network output；

Conversion module 230, for becoming for each pose in multiple continuous module and carriage transformation matrixes of the source view Matrix is changed, according to the module and carriage transformation matrix and the depth map, the grapheme and the characteristic pattern is converted respectively, obtained To the target grapheme and target signature for corresponding to each module and carriage transformation matrix, the multiple continuous pose converts square Battle array is module and carriage transformation matrix of the source view relative to the difference of multiple continuous picture frames；

Generation module 240, for according to the target grapheme and target signature for corresponding to each module and carriage transformation matrix Figure, generates picture frame respectively, obtains multiple continuous picture frames, wherein each described image frame and the source view are same The image of object different perspectives；

Synthesis module 250, for the multiple continuous picture frame to be synthesized video.

Using above-mentioned apparatus, by corresponding to the continuous pose sequence and depth map of source view, by the language of source view Justice figure and characteristic pattern carry out geometric transformation, can respectively obtain continuous multiple target graphemes and continuous multiple targets are special Then multiple target grapheme and its corresponding target signature are synthesized multiple continuous picture frames respectively again by sign figure. These continuous picture frames are synthesized into video again.Using this device, using the depth map of source view, grapheme and characteristic pattern The three-dimensional structure of invisible area can be deduced, and keeps its true texture, so that the picture frame generated be made to be more clear With it is true to nature.In this way, more life-like using the video that this device generates, stability is more preferable.In addition, due to the image in source view After frame, generate multiple continuous picture frames, therefore use this device, can be also used for into video two it is continuous or do not connect More picture frames are inserted between continuous picture frame.In this way, the video generated can be made to include more picture frames, view is improved The frame per second of frequency indirectly improves the frame per second of video camera, so as to improve the continuity of video so that video is more smooth.

Optionally, as shown in figure 3, conversion module 230 further include:

Computational submodule 231, each pixel for being directed in the characteristic pattern and the grapheme respectively is by such as Lower formula calculates coordinate of the pixel in the first picture frame:

[p_t]=dK [R | t] K^-1[p_s]

[R | t]=[R_s|t_s]^-1[R_t|t_t]

Optionally, as shown in figure 3, generation module 240 further include:

Optimize submodule 241, for special according to the target grapheme and target that correspond to each module and carriage transformation matrix Sign figure optimizes processing, and the optimization processing includes: holes filling and skew control；

First generates submodule 242, for according to the mesh after the optimization for corresponding to each module and carriage transformation matrix The target signature after marking grapheme and optimization, generates described image frame respectively.

Optionally, as shown in figure 3, generation module 240 further include:

Second generates submodule 243, for for the target grapheme and mesh corresponding to each module and carriage transformation matrix Characteristic pattern is marked, the target grapheme and target signature input are generated into the generator network in confrontation network, obtained described Picture frame corresponding to the module and carriage transformation matrix.

Also, the function of the characteristic matching loss are as follows:

Wherein,It indicates are as follows:

The disclosure additionally provides a kind of computer readable storage medium, is stored thereon with computer program instructions, the program The step of a kind of method for generation video that the disclosure provides is realized when instruction is executed by processor.

Fig. 4 is the block diagram of a kind of electronic equipment 400 shown according to an exemplary embodiment.As shown in figure 4, the electronics is set Standby 400 may include: processor 401, memory 402.The electronic equipment 400 can also include multimedia component 403, input/ Export one or more of (I/O) interface 404 and communication component 405.

Wherein, processor 401 is used to control the integrated operation of the electronic equipment 400, to complete a kind of above-mentioned generation view All or part of the steps in the method for frequency.Memory 402 is for storing various types of data to support in the electronic equipment 400 operation, these data for example may include any application or method for operating on the electronic equipment 400 Instruction and the relevant data of application program, such as contact data, the message of transmitting-receiving, picture, audio, video etc..This is deposited Reservoir 402 can realize by any kind of volatibility or non-volatile memory device or their combination, for example, it is static with Machine accesses memory (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), erasable programmable Read-only memory (Erasable Programmable Read-Only Memory, abbreviation EPROM), programmable read only memory (Programmable Read-Only Memory, abbreviation PROM), and read-only memory (Read-Only Memory, referred to as ROM), magnetic memory, flash memory, disk or CD.Multimedia component 403 may include screen and audio component.Wherein Screen for example can be touch screen, and audio component is used for output and/or input audio signal.For example, audio component may include One microphone, microphone is for receiving external audio signal.The received audio signal can be further stored in storage Device 402 is sent by communication component 405.Audio component further includes at least one loudspeaker, is used for output audio signal.I/O Interface 404 provides interface between processor 401 and other interface modules, other above-mentioned interface modules can be keyboard, mouse, Button etc..These buttons can be virtual push button or entity button.Communication component 405 is for the electronic equipment 400 and other Wired or wireless communication is carried out between equipment.Wireless communication, such as Wi-Fi, bluetooth, near-field communication (Near Field Communication, abbreviation NFC), 2G, 3G, 4G, NB-IOT, eMTC or other 5G etc. or they one or more of Combination, it is not limited here.Therefore the corresponding communication component 405 may include: Wi-Fi module, bluetooth module, NFC mould Block etc..

In one exemplary embodiment, electronic equipment 400 can be by one or more application specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), digital signal processor (Digital Signal Processor, abbreviation DSP), digital signal processing appts (Digital Signal Processing Device, Abbreviation DSPD), programmable logic device (Programmable Logic Device, abbreviation PLD), field programmable gate array (Field Programmable Gate Array, abbreviation FPGA), controller, microcontroller, microprocessor or other electronics member Part is realized, for the step of executing a kind of method of above-mentioned generation video.

In a further exemplary embodiment, a kind of computer readable storage medium including program instruction is additionally provided, it should The step of a kind of method of above-mentioned generation video is realized when program instruction is executed by processor.For example, this computer-readable is deposited Storage media can be the above-mentioned memory 402 including program instruction, and above procedure instruction can be by the processor of electronic equipment 400 401 methods executed to complete a kind of above-mentioned generation video.

Those skilled in the art will readily occur to other embodiment party of the disclosure after considering specification and practicing the disclosure Case.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or adaptability Variation follows the general principles of this disclosure and including the undocumented common knowledge or usual skill in the art of the disclosure Art means.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following claim It points out.

It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the accompanying claims.

The preferred embodiment of the disclosure is described in detail in conjunction with attached drawing above, still, the disclosure is not limited to above-mentioned reality The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure Monotropic type, these simple variants belong to the protection scope of the disclosure.

It is further to note that specific technical features described in the above specific embodiments, in not lance In the case where shield, can be combined in any appropriate way, in order to avoid unnecessary repetition, the disclosure to it is various can No further explanation will be given for the combination of energy.

In addition, any combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally Disclosed thought equally should be considered as disclosure disclosure of that.

Claims

1. a kind of method for generating video, which is characterized in that the described method includes:

The three-primary-color image of source view is inputted into depth and semantic network, obtains the depth map of the depth and semantic network output And grapheme；

By the grapheme and the three-primary-color image input feature vector encoder network, it is defeated to obtain the feature coding device network Characteristic pattern out；

For each module and carriage transformation matrix in multiple continuous module and carriage transformation matrixes of the source view, converted according to the pose Matrix and the depth map, convert the grapheme and the characteristic pattern respectively, obtain corresponding to each pose The target grapheme and target signature of transformation matrix, the multiple continuous module and carriage transformation matrix be the source view relative to The module and carriage transformation matrix of the difference of multiple continuous picture frames；

According to the target grapheme and target signature for corresponding to each module and carriage transformation matrix, picture frame is generated respectively, is obtained To multiple continuous picture frames, wherein each described image frame and the source view are the images of same target different perspectives；

The multiple continuous picture frame is synthesized into video.

2. the method according to claim 1, wherein multiple continuous poses for the source view become Each module and carriage transformation matrix in matrix is changed, according to the module and carriage transformation matrix and the depth map, to the grapheme and described Characteristic pattern is converted respectively, comprising:

The pixel is calculated first by following formula for each pixel in the characteristic pattern and the grapheme respectively Coordinate in picture frame:

[p_t]=dK [R | t] K^-1[p_s]

[R | t]=[R_s|t_s]^-1[R_t|t_t]

Wherein, d represents the depth value in the depth map at the pixel, and K represents the internal reference of camera, [R | t] represent the source view Scheme the module and carriage transformation matrix of opposite the first image frame, R represents rotation, and t represents translation, [R_s|t_s]、[R_t|t_t] respectively represent Pose of the camera under world coordinate system, p under the source view and the first image frame_sIndicate the pixel in the source view Under coordinate, p_tIndicate the coordinate under first picture frame.

3. the method according to claim 1, wherein the basis corresponds to each module and carriage transformation matrix Target grapheme and target signature, generate picture frame respectively, comprising:

Processing is optimized according to the target grapheme and target signature that correspond to each module and carriage transformation matrix, it is described excellent Change processing includes: holes filling and skew control；

According to the target grapheme after the optimization for corresponding to each module and carriage transformation matrix and the target after optimization Characteristic pattern generates described image frame respectively.

4. the method according to claim 1, wherein the basis corresponds to each module and carriage transformation matrix Target grapheme and target signature, generate picture frame respectively, obtain multiple continuous picture frames and include:

For the target grapheme and target signature for corresponding to each module and carriage transformation matrix, by the target grapheme and it is somebody's turn to do Target signature input generates the generator network in confrontation network, obtains the image for corresponding to the module and carriage transformation matrix Frame.

5. according to the method described in claim 4, it is characterized in that, the loss function for generating confrontation network are as follows:

Wherein, λ_FFor super ginseng, it to be used for controlling feature match penaltiesSignificance level, λ_WJoin to be super,Indicate image discriminating device Image impairment in network, the loss function of described image arbiter network are as follows:

Also, the function of the characteristic matching loss are as follows:

Wherein G represents generator network, and D represents arbiter network, D_kMultiple dimensioned arbiter network is represented, k represents more rulers The number of the arbiter network of degree, D₁, D₂The multiple dimensioned arbiter network of two different scales is respectively represented,Generation K-th of multiple dimensioned arbiter network in table described image arbiter network,Represent the video decision device net K-th of multiple dimensioned arbiter network in network, behalf source view, x represent target view, and n represents the layer of perceptron Number, N_iThe number of every layer of element is represented,Represent the corresponding multiple dimensioned arbiter network D of i-th layer of feature extractor_k, | | | |₁ 1 norm is represented, GAN, which is represented, generates confrontation network；

Wherein, the number of T representative image sequence, w_t、It respectively represents true between t frame and t+1 frame in image sequence Light stream and prediction light stream, x_t+1The image of t+1 frame is represented,It represents and combines Optic flow information, by x_tFrame image is mapped to x_t+1 The corresponding image of frame；

The training for generating confrontation network is the friendship that the loss function is maximized and minimized by following formula For training:

Wherein,It indicates are as follows:

6. a kind of device for generating video, which is characterized in that described device includes:

First obtains module, for the three-primary-color image of source view to be inputted depth and semantic network, obtains the depth and language The depth map and grapheme of adopted network output；

Second obtains module, for obtaining institute for the grapheme and the three-primary-color image input feature vector encoder network State the characteristic pattern of feature coding device network output；

Conversion module, each module and carriage transformation matrix in multiple continuous module and carriage transformation matrixes for being directed to the source view, According to the module and carriage transformation matrix and the depth map, the grapheme and the characteristic pattern are converted respectively, corresponded to In the target grapheme and target signature of each module and carriage transformation matrix, the multiple continuous module and carriage transformation matrix is institute State module and carriage transformation matrix of the source view relative to the difference of multiple continuous picture frames；

Generation module, for according to the target grapheme and target signature for corresponding to each module and carriage transformation matrix, difference Picture frame is generated, multiple continuous picture frames are obtained, wherein each described image frame and the source view are same target differences The image at visual angle；

7. device according to claim 6, which is characterized in that the conversion module includes:

Computational submodule, each pixel by being directed in the characteristic pattern and the grapheme respectively pass through based on following formula Calculate coordinate of the pixel in the first picture frame:

[p_t]=dK [R | t] K^-1[p_s]

[R | t]=[R_s|t_s]^-1[R_t|t_t]

Wherein, d represents the depth value in the depth map at the pixel, and K represents the internal reference of camera, [R | t] represent the source view Scheme the module and carriage transformation matrix of opposite the first image frame, R represents rotation, and t represents translation, [R_s|t_s], [R_t|t_t] respectively Represent pose of the camera under world coordinate system, p under the source view and the first image frame_sIndicate the pixel in the source Coordinate under view, p_tIndicate the coordinate under first picture frame.

8. device according to claim 6, which is characterized in that the generation module includes:

Optimize submodule, for carrying out according to the target grapheme and target signature that correspond to each module and carriage transformation matrix Optimization processing, the optimization processing include: holes filling and skew control；

First generates submodule, for according to the target grapheme after the optimization for corresponding to each module and carriage transformation matrix With the target signature after optimization, described image frame is generated respectively.

9. device according to claim 6, which is characterized in that the generation module further include:

Second generates submodule, for for the target grapheme and target signature corresponding to each module and carriage transformation matrix The target grapheme and target signature input are generated the generator network in confrontation network, obtain described correspond to by figure The picture frame of the module and carriage transformation matrix.

10. device according to claim 9, which is characterized in that the loss function for generating confrontation network are as follows:

Also, the function of the characteristic matching loss are as follows:

Wherein,It indicates are as follows:

11. a kind of computer readable storage medium, is stored thereon with computer program instructions, which is characterized in that the program instruction The step of any one of claim 1-5 the method is realized when being executed by processor.

12. a kind of electronic equipment characterized by comprising

Memory is stored thereon with computer program；

Processor, for executing the computer program in the memory, to realize described in any one of claim 1-5 The step of method.