CN118200463A

CN118200463A - Video generation method and electronic equipment

Info

Publication number: CN118200463A
Application number: CN202410267683.0A
Authority: CN
Inventors: 杨宇辰; 刘国祥
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2024-03-08
Filing date: 2024-03-08
Publication date: 2024-06-14

Abstract

The application discloses a video generation method and electronic equipment. Belongs to the technical field of video processing. An embodiment of the method comprises: extracting at least one key frame of the first video; performing style rendering on the key frames through an image style rendering model to obtain first images corresponding to the key frames; based on the style of the first image, performing style diffusion on adjacent frames of key frames in the first video to obtain a second image corresponding to the adjacent frames; a second video is generated based on the first image and the second image.

Description

Video generation method and electronic equipment

Technical Field

The embodiment of the application relates to the technical field of video processing, in particular to a video generation method and electronic equipment.

Background

With the development and popularization of video technology on terminal devices, the demand for the richness of media resources for users is also increasing. An author of video content desires to artistic work on the video content, e.g., a particular style may be applied to the video to change the style of the video content.

In the prior art, the video is usually subjected to frame-by-frame style conversion through a fixed image style rendering algorithm, so that the time consumption is too long, and the frame-to-frame style mutation is easy to occur, so that the fluency of the video style conversion is poor.

Disclosure of Invention

The embodiment of the application aims to provide a video generation method and electronic equipment, which can solve the technical problems of low video style conversion efficiency and poor fluency.

In a first aspect, an embodiment of the present application provides a video generating method, including: extracting at least one key frame of the first video; performing style rendering on the key frames through an image style rendering model to obtain first images corresponding to the key frames; based on the style of the first image, performing style diffusion on adjacent frames of the key frames in the first video to obtain a second image corresponding to the adjacent frames; a second video is generated based on the first image and the second image.

In a second aspect, an embodiment of the present application provides a video generating apparatus, including: an extracting unit for extracting at least one key frame of the first video; the style rendering unit is used for performing style rendering on the key frames through the image style rendering model to obtain first images corresponding to the key frames; the style diffusion unit is used for performing style diffusion on adjacent frames of the key frames in the first video based on the style of the first image to obtain a second image corresponding to the adjacent frames; and the generation unit is used for generating a second video based on the first image and the second image.

In a third aspect, an embodiment of the present application provides an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described in the first aspect above.

In a fifth aspect, an embodiment of the present application provides a chip, the chip including a processor and a communication interface, the communication interface being coupled to the processor, the processor being configured to execute a program or instructions to implement a method as described in the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, at least one key frame extracted from a first video is subjected to style rendering through an image style rendering model to obtain a first image corresponding to the key frame, then adjacent frames of the key frame in the first video are subjected to style diffusion based on the key frame and the first image to obtain a second image corresponding to the adjacent frames, and finally the second video is generated based on the first image and the second image, so that the second video after style conversion of each frame in the first video can be obtained. Because only the style rendering model is used for performing style rendering on the key frames of the first video, the problem that the frame-by-frame style rendering consumes too much time is solved, and the video style conversion efficiency is improved. Because the style rendering result of the adjacent frame is obtained based on the style rendering result of the key frame, the fluency of video style rendering is improved.

Drawings

Fig. 1 is a flowchart of a video generating method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a key frame extraction process according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a keyframe style rendering process according to an embodiment of the present application;

FIG. 4 is a flow chart of a neighbor frame style diffusion process of an embodiment of the present application;

FIG. 5 is a schematic illustration of the processing of an image segmentation model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a neighbor frame style diffusion process according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an optical flow based information processing process according to an embodiment of the application;

fig. 8 is one of schematic diagrams of an application scenario of a video generating method according to an embodiment of the present application;

FIG. 9 is a second schematic diagram of an application scenario of a video generating method according to an embodiment of the present application;

FIG. 10 is a third schematic diagram of an application scenario of the video generating method according to the embodiment of the present application;

Fig. 11 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 13 is a schematic diagram of a hardware configuration of an electronic device suitable for use in implementing an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The video generating method and the device provided by the embodiment of the application are described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.

Referring to fig. 1, one of flowcharts of a video generating method according to an embodiment of the present application is shown. The video generation method provided by the embodiment of the application can be applied to electronic equipment. In practice, the electronic device may be an electronic device such as a smart phone, a tablet computer, a laptop, a smart screen, or VR (Virtual Reality) glasses.

The video generation method provided by the embodiment of the application comprises the following steps:

at step 101, at least one key frame of a first video is extracted.

In this embodiment, the first video may be a video to be style-rendered. The embodiment of the application does not limit the type, the content and the format of the first video.

In this embodiment, the first video may include a plurality of frames. In practice, video may be described in terms of frames (frames). A frame is the smallest visual unit that makes up a video. Each frame is a static image. A sequence of temporally successive frames is synthesized together to form a dynamic video.

In this embodiment, at least one key frame of the first video may be extracted according to a preset extraction rule. As an example, a decisive frame in the first video (e.g., a frame in which a critical action in a character or object motion change is located) may be taken as a key frame. As yet another example, key frames in the first video may be extracted at fixed or non-fixed intervals.

In some alternative implementations of the present embodiment, key frames in the first video may be extracted at fixed intervals. Specifically, the sampling interval of the key frame may be first determined based on the frame rate of the first video. And extracting the key frames of the first video according to the sampling interval. Referring to fig. 2, the sampling interval may be denoted as N, i.e., one frame is selected as a key frame every N frames. The sampling interval N may be user-defined. As an example, n=5 may be set.

By extracting the key frames of the first video at fixed sampling intervals and performing style rendering, the problem that frame-by-frame style rendering is too long in time can be solved, the video style rendering processing time is shortened, and the video style rendering efficiency is improved.

In some optional implementations of the present embodiment, the sampling interval of the key frame may be determined based on an input frame rate, an output frame rate, and preset parameters of the first video. And then, extracting the key frames of the first video according to the sampling interval. See the following formula:

Wherein alpha and lambda are algorithm adjustable parameters which can be preset. fr _in denotes an input frame rate, fr _out denotes an output frame rate. Representation of the upper rounding,/>The representation is rounded down.

The setting of the sampling interval can ensure that the video style rendering effect is not mutated as far as possible, and the fluency of the video after style rendering is ensured.

And 102, performing style rendering on the key frame through an image style rendering model to obtain a first image corresponding to the key frame.

In this embodiment, the image style rendering model may be stored in the electronic device in advance. For each extracted key frame, the key frame can be input into an image style rendering model to obtain a first image corresponding to the key frame. Wherein the image style rendering model is operable to style render the image. The image style rendering model can be trained in advance by adopting a machine learning method. The image style rendering model may be trained based on a generative model. The generative model may process the input image and output the processed image. As examples, may include, but are not limited to: a generated countermeasure network (GENERATIVE ADVERSARIAL NETS, GAN), a variational self-encoder (Variational Auto-Encoding, VAE), a sequence generation model (Sequence Generation Model, SGM), a denoising diffusion probability model (Denoising Diffusion Probabilistic Model, DDPM), an automatic encoder (Auto Encoder), and the like.

In some optional implementations of this embodiment, the at least one key frame may include a first key frame and a second key frame, and rendering styles of the first key frame and the second key frame may be different. In practice, the style rendering model can output images with style diversity by adjusting the input parameters of the style rendering model. Thus, the richness of the video style can be improved.

In some alternative implementations of the present embodiment, the following steps may be performed: first, description information is acquired. And then, inputting the description information and the key frames into an image style rendering model to obtain a first image corresponding to the key frames. Wherein the description information may be text information for describing the style of the image. By adding the description information, the rendering pictures in different styles can be generated, so that the image style rendering model has stronger generalization capability.

In some optional implementations of the present embodiment, the network structure of the Image style rendering model may include a contrast language Image Pre-Training (CLIP) model, a control network (ControlNet), and the generative model described above. The descriptive information may be converted to textual features by comparing the language image pre-training model and used as an input to the generative model to direct the generative model to generate the corresponding image. The control network may be used to control the manner in which the image of the generative model is generated. The control network is a neural network architecture, and the network can be used for setting or restraining the style of the image generated by the control generation type model, so that the capability of the image generation type model is expanded, and the controllability of the generation type model is greatly improved.

As an example, referring to fig. 3, the generative model used herein may be the denoising diffusion probability model described above. The denoising diffusion probability model is stable in training and can generate various samples. The denoising diffusion probability model comprises two processes: forward diffusion process and reverse generation process. During forward diffusion, multiple iterations of the input first image may be performed, each step gradually adding gaussian noise on the basis of the first image. The second image may be generated by inverting the noise during back-diffusion, gradually subtracting gaussian noise from the first image at each step.

The descriptive information may include "Castle interior", "one-piece dress", "sharp chin", "sunlight reflection", etc. The description information can be converted into text features by comparing the language image pre-training model and used as one input of the denoising diffusion probability model, so that the denoising diffusion probability model is guided to generate a corresponding image. The control network adds additional conditions to the denoising diffusion probability model to control the denoising diffusion probability model. The control network divides the network structure into trainable and untrainable parts. Wherein the trainable part learns for the controllable part. While the untrainable part retains the original data of the denoising diffusion probability model. The method can ensure that the self learning capacity of the denoising diffusion probability model is reserved on the premise of fully learning the front constraint by using a small amount of data guidance.

Compared with the traditional image style rendering model, the image style rendering model trained by using the denoising diffusion probability model is additionally provided with the contrast language image pre-training model and the control model as guidance, so that rendering pictures in different styles can be generated, and the image style rendering model has stronger generalization capability.

In some optional implementation manners of this embodiment, since the image style rendering model has large parameter amount, low speed and high computation complexity, and may cause performance loss due to insufficient memory, in order to deploy the image style rendering model in the terminal device, the model may be compressed, and the adaptation terminal low power consumption platform completes the deployment of the model lightweight terminal. For example, model compression may be performed by model pruning, parameter quantization, low rank decomposition, network model search (Neural Architecture Search, NAS), knowledge distillation (Knowledge Distillation, KD), and the like. Taking NAS as an example for explanation, the basic principle is that in a search set of a given candidate neural network structure, an optimal network structure is searched out from the set according to a sample training set and a performance evaluation index as a limiting condition. NAS includes three modules: search space, search policy sampling network, performance assessment of sampling network. The search space is divided into a global search space and a local search space, and the local search space combines the local search structures into a complete network structure in a stacking and splicing mode. The search strategy may choose a reinforcement learning based method, an evolutionary algorithm based method, or a gradient based method. Performance indicators may be evaluated with accuracy, speed, etc.

And 103, performing style diffusion on adjacent frames of the key frames in the first video based on the styles of the first images to obtain second images corresponding to the adjacent frames.

In this embodiment, each key frame corresponds to at least one neighboring frame. For each key frame, the electronic device may perform style diffusion on the adjacent frame based on the style of the first image corresponding to the key frame, to obtain the second image corresponding to the adjacent frame.

In some alternative implementations of the present embodiment, the electronic device may first determine neighboring frames of the key frame based on the sampling interval of the key frame. Specifically, if the sampling interval is N, the neighboring frames of the key frame may be the N-1 frame before and the N-1 frame after the key frame. For example, n=5, then the neighboring frames of the key frame may include the first 4 frames and the last 4 frames of the key frame. And then, carrying out style rendering on the adjacent frames based on the first images corresponding to the key frames to obtain second images corresponding to the adjacent frames. Therefore, all frames in the first video can be guaranteed to be subjected to style rendering, and all frames are not required to be input into an image style rendering model, so that the video style rendering efficiency is improved.

In some alternative implementations of the present embodiment, referring to fig. 4, the following sub-steps may be performed:

In a substep S11, optical flow information between the key frame and the adjacent frame of the key frame is acquired.

Here, the optical flow information is the movement amount of the pixel point of the same object in the video image, and the optical flow information between the key frame and the adjacent frame can be obtained through various optical flow information algorithms. For example, ways that include, but are not limited to, lucas-Kanade (LK, rucks-Canada) optical flow algorithm, flowNet (optical flow estimation model), flowNet2.0 (second version of optical flow estimation model), deepFlow (an optical flow algorithm), and the like may be included.

Taking the Lucas-Kanade optical flow algorithm as an example, the basic principle is to calculate the motion vectors of two frames before and after the pixel point in a two-dimensional picture. For example, a pixel located at the (x, y) position at time t is located at the (x+u, y+u) position at time t+Δt, and it is assumed that the brightness of the pixels of the two frames before and after is unchanged, so I (x, y, t) =i (x+u, y+v, t), and optical flow estimation is used to estimate a motion vector (u, v) between the same pixels of the two frames before and after. The Lucas-Kanada optical flow algorithm is a Taylor expansion of the above equation, and meanwhile, the (u, v) calculation is converted into a least square operation on the assumption that the neighborhood optical flow motions are similar.

In a substep S12, a first feature map of the key frame is extracted.

Here, the image feature extraction may be performed on the key frame by the feature extraction model, to obtain the first feature map. The number of first feature maps may be one or more, and is not limited herein.

In some optional implementations of the present embodiment, referring to fig. 5, the key frame may be input to an edge detection model, an image segmentation model, and a contour detection model, to obtain a first edge image, a first segmentation mask image, and a first contour image corresponding to the key frame. The first edge image, the first segmentation mask image, and the first contour image are feature images.

Wherein the segmentation calculation model may be used to calculate segmentation mask information for an object in the image. The segmentation calculation Model may be, but is not limited to, SEGMENT ANYTHING Model (SAM). Taking SEGMENT ANYTHING Model as an example, a key frame firstly passes through an image coding network in the Model to generate a feature vector, the feature vector of a picture is input as a mask decoder, and finally a segmentation mask picture is generated. Alternatively, referring to fig. 6, the model may also select to input a hint word, where the hint word passes through a hint word encoder to generate a corresponding text feature vector. The picture feature vector and the text feature vector (optional) together serve as a mask decoder input, ultimately generating a segmentation mask picture.

The edge detection model can be used for extracting edge features in the image. The edge detection algorithm used by the edge detection model may select, but is not limited to, the Canny edge detection algorithm. The Canny edge detection algorithm comprises five steps of Gaussian filtering, pixel gradient calculation, non-maximum suppression, thresholding and weak edge connection. The Gaussian filtering is used for noise reduction, and the influence of noise on the accuracy of edge detection is reduced. Pixel gradient calculations and non-maxima suppression are used for preliminary edge determination. Thresholding employs a double threshold to determine the true and potential edge pixels. The connected weak edge determines a final edge pixel point (where all above a first threshold are detected as edges and all below a second threshold are detected as non-edges) based on the surrounding connectivity of the edge pixel point.

The contour detection model can be used for extracting the edge information of the thick line of the object in the picture. The contour detection model is similar to the edge detection model, and the feathering function can be added on the basis of the edge detection model to extract soft edges, so that details are rich, and the contour sense is stronger.

It should be noted that the first feature map is not limited to the above list, but may include other types of feature maps, which are not limited herein.

In a substep S13, a second feature map of the adjacent frame is generated based on the optical flow information and the first feature map.

Here, the pixel points in the first feature map may be moved based on the optical flow information, so as to obtain a second feature map of the adjacent frame. In practice, the first feature map may be segmented to obtain a plurality of first feature image blocks. For example, the first feature image block may be segmented with a fixed interval according to the picture size (length, width). And after the segmentation is finished, processing each first characteristic image block by using the optical flow information, and predicting to obtain a corresponding second characteristic image block. And merging the second characteristic image blocks to obtain a second characteristic image. By segmenting the feature image block, the processing can be performed by using multithreading, so that the generation efficiency of the second feature image is higher.

As an example, the data storage formats of the first feature map and the second feature map are as follows:

In some optional implementations of the present embodiment, referring to fig. 5, in a case where the first feature map includes the first edge image, the first segmentation mask image, and the first contour image, the first edge image, the first segmentation mask image, and the first contour image may be processed based on the optical flow information to obtain the second edge image, the second segmentation mask image, and the second contour image of the adjacent frame. The second edge image, the second segmentation mask image and the second contour image are all second feature images.

In some optional implementations of this embodiment, the first edge image, the first segmentation mask image, and the first contour image may be segmented respectively to obtain a plurality of first edge image blocks, a plurality of first segmentation mask image blocks, and a plurality of first contour image blocks. And then, respectively processing the plurality of first edge image blocks, the plurality of first segmentation mask image blocks and the plurality of first contour image blocks based on the optical flow information to obtain a second edge image, a second segmentation mask image and a second contour image of the adjacent frame. Specifically, three feature maps of adjacent frames can be predicted according to the motion vector of each pixel point. As shown in fig. 7. 701 is the three frames before and after the key frame t (including the adjacent frame t-1, key frame t, and adjacent frame t+1). Pixel A, B and C represent the positions of the same pixel in the t-1 frame, t frame and t+1 frame. The above optical flow prediction method can respectively predict the motion vector flow _a→b of the pixel point from the A position in the t-1 frame to the B position in the t frame and the motion vector flow _b→c from the B position to the C position.

For the first segmentation mask image, the segmentation area around the pixel point a may be moved to a new position according to the motion vector flow _a→b, and the segmentation area around the pixel point B may be moved to a new position according to the motion vector flow _b→c, as shown by reference numeral 702. For the first contour image, the same method may be used to move the line-and-manuscript area covered by the pixel around a by a new position according to the motion vector flow _a→b, and to move the line-and-manuscript area covered by the pixel around B by a new position according to the motion vector flow _b→c, as indicated by reference numeral 703. For the first edge image, the same method may be used to move the edge area around the pixel point a by a new position according to the motion vector flow _a→b, and the edge area around the pixel point B by a new position according to the motion vector flow _b→c, as shown by reference numeral 704. Compared with the method for directly registering the original picture by using the light flow information, the pixel color RGB values in the three feature images are simpler, so that the accuracy is higher when the frame prediction is carried out.

In a substep S14, a second image corresponding to the adjacent frame is generated based on the first image and the second feature map.

Here, the color RGB values in the first image may be filled in the following second feature map to form the second image corresponding to the adjacent frame. And filling the original RGB pixel values in the key frame into the pixel point positions in the second feature map according to the corresponding relation between the pixel point positions in the first feature map and the pixel point positions in the second feature map in the table, so as to obtain a second image.

In some optional implementations of the present embodiment, referring to fig. 5, in a case where the second feature map includes the second edge image, the second segmentation mask image, and the second contour image, the second image corresponding to the adjacent frame may be generated based on the second edge image, the second segmentation mask image, the second contour image, and the first image. The second segmentation mask picture ensures the color of the whole object, the second contour image ensures the contour color of the whole object, and the second edge image ensures the edge detail color of the whole object. Compared with the traditional scheme of inputting the video into the model, the embodiment of the application carries out style rendering on the adjacent frames of the key frames through the optical flow information, can better reduce video jitter, improves the overall rendering result of the overall video and ensures the fluency and stability of video style rendering.

Step 104, generating a second video based on the first image and the second image.

In this embodiment, the first image and the second image generated as described above may be subjected to frame-combining processing using a ffmpeg command to generate a second video. In addition, the second video may be generated by performing frame combination processing on the first image and the second image based on the parameter of the first video. The above parameters may include, but are not limited to, at least one of: video codec, number of video frames, frame rate, video display aspect ratio, etc. If the original video has audio, audio may also be incorporated into the video.

In some alternative implementations of the present embodiment, the neighboring frames may be processed twice, since the neighboring frames of each key frame are the first N-1 frame and the last N-1 frame of the key frame. Therefore, when at least two second images corresponding to the adjacent frames exist, the at least two second images corresponding to the same adjacent frame can be fused to obtain a third image, and then, a second video is generated based on the first image and the third image. Therefore, video jitter can be reduced better, the overall video overall rendering result is improved, and the fluency and stability of video style rendering are ensured. See the following formula:

img_rslt＝α×img^f _ij+β×img^e _jk

Wherein, α and β respectively represent fusion weights, and may be set to 0.5.img ^f _ij represents the style rendering result of the forward i frame to the j frame; img ^e _jk represents the rendering result of the reverse kth frame to the jth frame.

In some application scenarios, referring to fig. 8, the embodiment of the present application may be applied to a terminal device. The first video may be a real shot video, a video stored in an album, a video stored in a cloud, and the like. The method comprises the steps of extracting a key frame from a first video, performing style rendering on the key frame, diffusing a style rendering result of the key frame into adjacent frames, and finally fusing the rendering results to generate a second video. The video jitter phenomenon of style rendering can be reduced, the overall rendering result of the video is improved, and the fluency and stability of video style rendering are ensured.

In some application scenarios, the embodiments of the present application may be applied to VR glasses. Referring to fig. 9, the first video may be a real shot video of VR glasses and the output second video may be a game style video. VR glasses are application products of a virtual reality technology, mainly including: the display screen is used for displaying virtual reality and images; the optical lens is used for projecting the image on the display screen into eyes of a user to form a realistic virtual reality scene; a sensor: the system comprises a gyroscope, an accelerometer, a magnetometer and the like, and is used for tracking the head movement and the gesture of a user so as to adjust the display of a virtual reality scene; a processor; a battery: so that the user can use without being limited by the power supply. An audio device: such as headphones or speakers, for providing a realistic sound effect experience. In current games for VR glasses, it is necessary to demarcate a secure area in which only the game experience is played. Another VR game requires that the surrounding environment be scanned in advance by using glasses, and the game generates a corresponding VR scene according to the surrounding detected environment. In these applications, the surrounding environment needs to be identified in advance. The embodiment of the application can simplify complex operations such as VR glasses environment scanning and safety zone demarcating, so that VR glasses can generate appointed game scenes by combining actual scenes in real time. Referring to fig. 9, the first video may be a real shot video of VR glasses and the output second video may be a game style video.

In some application scenarios, embodiments of the present application may be applied to online meeting scenarios. Referring to fig. 10, the first video may be a real conference picture video, and the output second video may be a conference picture video unifying a background of a conference. Online conferencing refers to connecting remote participants to the same conference through internet technology for virtual communication and collaboration. Currently, online conferences have become an indispensable part of the fields of people work, study, social contact and the like, support remote office, remote education, remote medical treatment, remote social contact and remote sales, promote remote collaboration and communication, ensure social contact distance, promote digital transformation, and expand markets, clients and the like. However, in the online conference, the surrounding space environment where each participant is located is different, so that the background is different when the video is accessed, on one hand, the virtual conference is informal, and on the other hand, the privacy of the conference participants is not protected. Based on the scheme, the method can support the scene background switching of the online conference, uniformly modify the background styles of all participants, protect the privacy of the users of the participants, improve the formality of the online conference and enhance the product use viscosity of the users.

According to the method provided by the embodiment of the application, at least one key frame extracted from the first video is subjected to style rendering through the image style rendering model to obtain a first image corresponding to the key frame, then adjacent frames of the key frame in the first video are subjected to style diffusion based on the key frame and the first image to obtain a second image corresponding to the adjacent frames, and finally the second video is generated based on the first image and the second image, so that the second video after style conversion of each frame in the first video can be obtained. Because only the style rendering model is used for performing style rendering on the key frames of the first video, the problem that the frame-by-frame style rendering consumes too much time is solved, and the video style conversion efficiency is improved. Because the style rendering result of the adjacent frame is obtained based on the style rendering result of the key frame, the fluency of video style rendering is improved.

It should be noted that, in the video generating method provided by the embodiment of the present application, the execution subject may be a video generating device. In the embodiment of the present application, a method for executing video generation by a video generation device is taken as an example, and the video generation device provided in the embodiment of the present application is described.

As shown in fig. 11, the video generating apparatus 1100 of the present embodiment includes: an extracting unit 1101 for extracting at least one key frame of the first video; the style rendering unit 1102 is configured to perform style rendering on the key frame through an image style rendering model to obtain a first image corresponding to the key frame; a style diffusion unit 1103, configured to perform style diffusion on adjacent frames of the key frames in the first video based on the style of the first image, so as to obtain a second image corresponding to the adjacent frames; a generating unit 1104 for generating a second video based on the first image and the second image.

In some optional implementations of this embodiment, the style diffusing unit 1103 is further configured to: acquiring optical flow information between the key frame and adjacent frames of the key frame; extracting a first feature map of the key frame; generating a second feature map of the adjacent frame based on the optical flow information and the first feature map; and generating a second image corresponding to the adjacent frame based on the first image and the second feature map. Therefore, all frames in the first video can be guaranteed to be subjected to style rendering, and all frames are not required to be input into an image style rendering model, so that the video style rendering efficiency is improved.

In some optional implementations of this embodiment, the style diffusing unit 1103 is further configured to: inputting the key frame into an edge detection model, an image segmentation model and a contour detection model to obtain a first edge image, a first segmentation mask image and a first contour image corresponding to the key frame; and processing the first edge image, the first segmentation mask image and the first contour image based on the optical flow information to obtain a second edge image, a second segmentation mask image and a second contour image of the adjacent frame. Compared with the traditional scheme of inputting the video into the model, the embodiment of the application carries out style rendering on the adjacent frames of the key frames through the optical flow information, can better reduce video jitter, improves the overall rendering result of the overall video and ensures the fluency and stability of video style rendering.

In some optional implementations of this embodiment, the style diffusing unit 1103 is further configured to: dividing the first edge image, the first segmentation mask image and the first contour image respectively to obtain a plurality of first edge image blocks, a plurality of first segmentation mask image blocks and a plurality of first contour image blocks; and processing the plurality of first edge image blocks, the plurality of first segmentation mask image blocks and the plurality of first contour image blocks based on the optical flow information respectively to obtain a second edge image, a second segmentation mask image and a second contour image of the adjacent frame. Compared with the traditional scheme of inputting the video into the model, the embodiment of the application carries out style rendering on the adjacent frames of the key frames through the optical flow information, can better reduce video jitter, improves the overall rendering result of the overall video and ensures the fluency and stability of video style rendering.

In some optional implementations of the present embodiment, the style rendering unit 1102 is further configured to: acquiring description information; and inputting the description information and the key frame into an image style rendering model to obtain a first image corresponding to the key frame. By adding the description information, the rendering pictures in different styles can be generated, so that the image style rendering model has stronger generalization capability.

In some optional implementations of the present embodiment, the generating unit is further configured to: under the condition that two corresponding second images exist in the adjacent frame, fusing the two corresponding second images of the same adjacent frame to obtain a third image; a second video is generated based on the first image and the third image. Therefore, video jitter can be reduced better, the overall video overall rendering result is improved, and the fluency and stability of video style rendering are ensured.

In some optional implementations of the present embodiment, the extracting unit 1101 is further configured to: determining a sampling interval of a key frame based on an input frame rate, an output frame rate and preset parameters of the first video; and extracting the key frames of the first video according to the sampling interval. By extracting the key frames of the first video at fixed sampling intervals and performing style rendering, the problem that frame-by-frame style rendering is too long in time can be solved, the video style rendering processing time is shortened, and the video style rendering efficiency is improved.

According to the device provided by the embodiment of the application, at least one key frame extracted from the first video is subjected to style rendering through the image style rendering model to obtain the first image corresponding to the key frame, then the adjacent frames of the key frame in the first video are subjected to style diffusion based on the key frame and the first image to obtain the second image corresponding to the adjacent frames, and finally the second video is generated based on the first image and the second image, so that the second video after style conversion of each frame in the first video can be obtained. Because only the style rendering model is used for performing style rendering on the key frames of the first video, the problem that the frame-by-frame style rendering consumes too much time is solved, and the video style conversion efficiency is improved. Because the style rendering result of the adjacent frame is obtained based on the style rendering result of the key frame, the fluency of video style rendering is improved.

The video generating device in the embodiment of the application can be an electronic device or a component in the electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. The electronic device may be a Mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a Mobile internet appliance (Mobile INTERNET DEVICE, MID), an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a robot, a wearable device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc., and may also be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, etc., which are not particularly limited in the embodiments of the present application.

The video generating apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.

The video generating apparatus provided in the embodiment of the present application can implement each process implemented by the method embodiment of fig. 1, and in order to avoid repetition, details are not repeated here.

Optionally, as shown in fig. 12, the embodiment of the present application further provides an electronic device 1200, including a processor 1201 and a memory 1202, where the memory 1202 stores a program or an instruction that can be executed on the processor 1201, and the program or the instruction implements the steps of the embodiment of the video generating method when executed by the processor 1201, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.

Fig. 13 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1300 includes, but is not limited to: radio frequency unit 1301, network module 1302, audio output unit 1303, input unit 1304, sensor 1305, display unit 13013, user input unit 1307, interface unit 1308, memory 1309, and processor 1310.

Those skilled in the art will appreciate that the electronic device 1300 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1310 by a power management system, such as to perform functions such as managing charging, discharging, and power consumption by the power management system. The electronic device structure shown in fig. 13 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

Wherein the processor 1310 is configured to extract at least one key frame of the first video; performing style rendering on the key frames through an image style rendering model to obtain first images corresponding to the key frames; based on the style of the first image, performing style diffusion on adjacent frames of the key frames in the first video to obtain a second image corresponding to the adjacent frames; a second video is generated based on the first image and the second image.

And performing style rendering on at least one key frame extracted from the first video through an image style rendering model to obtain a first image corresponding to the key frame, performing style diffusion on adjacent frames of the key frame in the first video based on the key frame and the first image to obtain a second image corresponding to the adjacent frame, and finally generating the second video based on the first image and the second image, so that a second video after style conversion on each frame in the first video can be obtained. Because only the style rendering model is used for performing style rendering on the key frames of the first video, the problem that the frame-by-frame style rendering consumes too much time is solved, and the video style conversion efficiency is improved. Because the style rendering result of the adjacent frame is obtained based on the style rendering result of the key frame, the fluency of video style rendering is improved.

Alternatively, the at least one key frame may include a first key frame and a second key frame, and rendering styles of the first key frame and the second key frame may be different. In practice, the style rendering model can output images with style diversity by adjusting the input parameters of the style rendering model. Thus, the richness of the video style can be improved.

Optionally, the processor 1310 is further configured to: acquiring optical flow information between the key frame and adjacent frames of the key frame; extracting a first feature map of the key frame; generating a second feature map of the adjacent frame based on the optical flow information and the first feature map; and generating a second image corresponding to the adjacent frame based on the first image and the second feature map. Therefore, all frames in the first video can be guaranteed to be subjected to style rendering, and all frames are not required to be input into an image style rendering model, so that the video style rendering efficiency is improved.

Optionally, the processor 1310 is further configured to: inputting the key frame into an edge detection model, an image segmentation model and a contour detection model to obtain a first edge image, a first segmentation mask image and a first contour image corresponding to the key frame; and processing the first edge image, the first segmentation mask image and the first contour image based on the optical flow information to obtain a second edge image, a second segmentation mask image and a second contour image of the adjacent frame. Compared with the traditional scheme of inputting the video into the model, the embodiment of the application carries out style rendering on the adjacent frames of the key frames through the optical flow information, can better reduce video jitter, improves the overall rendering result of the overall video and ensures the fluency and stability of video style rendering.

Optionally, the processor 1310 is further configured to: dividing the first edge image, the first segmentation mask image and the first contour image respectively to obtain a plurality of first edge image blocks, a plurality of first segmentation mask image blocks and a plurality of first contour image blocks; and processing the plurality of first edge image blocks, the plurality of first segmentation mask image blocks and the plurality of first contour image blocks based on the optical flow information respectively to obtain a second edge image, a second segmentation mask image and a second contour image of the adjacent frame. The second segmentation mask picture ensures the color of the whole object, the second contour image ensures the contour color of the whole object, and the second edge image ensures the edge detail color of the whole object. Compared with the traditional scheme of inputting the video into the model, the embodiment of the application carries out style rendering on the adjacent frames of the key frames through the optical flow information, can better reduce video jitter, improves the overall rendering result of the overall video and ensures the fluency and stability of video style rendering.

Optionally, the processor 1310 is further configured to: acquiring description information; and inputting the description information and the key frame into an image style rendering model to obtain a first image corresponding to the key frame. By adding the description information, the rendering pictures in different styles can be generated, so that the image style rendering model has stronger generalization capability.

Optionally, the processor 1310 is further configured to: under the condition that two corresponding second images exist in the adjacent frame, fusing the two corresponding second images of the same adjacent frame to obtain a third image; a second video is generated based on the first image and the third image. Therefore, video jitter can be reduced better, the overall video overall rendering result is improved, and the fluency and stability of video style rendering are ensured.

Optionally, the processor 1310 is further configured to: determining a sampling interval of a key frame based on an input frame rate, an output frame rate and preset parameters of the first video; and extracting the key frames of the first video according to the sampling interval. By extracting the key frames of the first video at fixed sampling intervals and performing style rendering, the problem that frame-by-frame style rendering is too long in time can be solved, the video style rendering processing time is shortened, and the video style rendering efficiency is improved.

It should be appreciated that in embodiments of the present application, the input unit 1304 may include a graphics processor (Graphics Processing Unit, GPU) 13041 and a microphone 13042, the graphics processor 13041 processing image data of still pictures or video obtained by an image capture device (e.g., a camera) in a video capture mode or an image capture mode. The display unit 13013 may include a display panel 130131, and the display panel 130131 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1307 includes at least one of a touch panel 13071 and other input devices 13072. Touch panel 13071, also referred to as a touch screen. The touch panel 13071 may include two parts, a touch detection device and a touch controller. Other input devices 13072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.

Memory 1309 may be used to store software programs as well as various data. The memory 1309 may mainly include a first memory area storing programs or instructions and a second memory area storing data, wherein the first memory area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 1309 may include volatile memory or nonvolatile memory, or the memory 1309 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDRSDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct random access memory (DRRAM). Memory 1309 in embodiments of the application include, but are not limited to, these and any other suitable types of memory.

The processor 1310 may include one or more processing units; optionally, processor 1310 integrates an application processor that primarily handles operations related to the operating system, user interface, and applications, and a modem processor that primarily handles wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 1310.

The embodiment of the application also provides a readable storage medium, on which a program or an instruction is stored, which when executed by a processor, implements each process of the embodiment of the video generating method, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc.

The embodiment of the application further provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the processes of the embodiment of the video generation method, and can achieve the same technical effects, so that repetition is avoided, and the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

Embodiments of the present application provide a computer program product stored in a storage medium, where the program product is executed by at least one processor to implement the respective processes of the video generation method embodiment described above, and achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A method of video generation, the method comprising:

extracting at least one key frame of the first video;

Performing style rendering on the key frames through an image style rendering model to obtain first images corresponding to the key frames;

based on the style of the first image, performing style diffusion on adjacent frames of the key frames in the first video to obtain a second image corresponding to the adjacent frames;

A second video is generated based on the first image and the second image.

2. The method of claim 1, wherein the at least one key frame comprises a first key frame and a second key frame, wherein the first key frame and the second key frame differ in rendering style.

3. The method according to claim 1, wherein performing style diffusion on neighboring frames of the key frame in the first video based on the style of the first image to obtain a second image corresponding to the neighboring frames includes:

Acquiring optical flow information between the key frame and adjacent frames of the key frame;

extracting a first feature map of the key frame;

generating a second feature map of the adjacent frame based on the optical flow information and the first feature map;

and generating a second image corresponding to the adjacent frame based on the first image and the second feature map.

4. A method according to claim 3, wherein said extracting a first feature map of the key frame comprises:

Inputting the key frame into an edge detection model, an image segmentation model and a contour detection model to obtain a first edge image, a first segmentation mask image and a first contour image corresponding to the key frame;

And processing the first edge image, the first segmentation mask image and the first contour image based on the optical flow information to obtain a second edge image, a second segmentation mask image and a second contour image of the adjacent frame.

5. The method of claim 4, wherein the processing the first edge image, the first segmentation mask image, and the first contour image based on the optical flow information to obtain a second edge image, a second segmentation mask image, and a second contour image of the neighboring frame comprises:

dividing the first edge image, the first segmentation mask image and the first contour image respectively to obtain a plurality of first edge image blocks, a plurality of first segmentation mask image blocks and a plurality of first contour image blocks;

And processing the plurality of first edge image blocks, the plurality of first segmentation mask image blocks and the plurality of first contour image blocks based on the optical flow information respectively to obtain a second edge image, a second segmentation mask image and a second contour image of the adjacent frame.

6. The method according to claim 1, wherein performing style rendering on the key frame through an image style rendering model to obtain a first image corresponding to the key frame comprises:

Acquiring description information;

and inputting the description information and the key frame into an image style rendering model to obtain the first image corresponding to the key frame.

7. The method of claim 1, wherein the generating a second video based on the first image and the second image comprises:

under the condition that two corresponding second images exist in the adjacent frame, fusing the two corresponding second images of the same adjacent frame to obtain a third image;

a second video is generated based on the first image and the third image.

8. The method of claim 1, wherein extracting key frames of the first video comprises:

determining a sampling interval of the key frame based on the input frame rate, the output frame rate and preset parameters of the first video;

and extracting the key frames of the first video according to the sampling interval.

9. A video generating apparatus, the apparatus comprising:

An extracting unit for extracting at least one key frame of the first video;

the style rendering unit is used for performing style rendering on the key frames through the image style rendering model to obtain first images corresponding to the key frames;

The style diffusion unit is used for performing style diffusion on adjacent frames of the key frames in the first video based on the style of the first image to obtain a second image corresponding to the adjacent frames;

And the generation unit is used for generating a second video based on the first image and the second image.

10. The apparatus of claim 9, wherein the at least one keyframe comprises a first keyframe and a second keyframe, wherein the first keyframe and the second keyframe differ in rendering style.

11. The apparatus of claim 9, wherein the style diffusing unit is further configured to:

extracting a first feature map of the key frame;

12. The apparatus of claim 11, wherein the style diffusing unit is further configured to:

The determining a second feature map of the adjacent frame based on the optical flow information and the first feature map includes:

13. The apparatus of claim 12, wherein the style diffusing unit is further configured to:

14. The apparatus of claim 9, wherein the style rendering unit is further configured to:

Acquiring description information;

15. The apparatus of claim 9, wherein the generating unit is further configured to:

a second video is generated based on the first image and the third image.

16. The apparatus of claim 9, wherein the extraction unit is further configured to:

17. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the video generation method of any of claims 1-8.

18. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the video generation method according to any of claims 1-8.