CN117830099A

CN117830099A - Video super-resolution method, device, equipment and storage medium

Info

Publication number: CN117830099A
Application number: CN202311818992.4A
Authority: CN
Inventors: 姚霆; 龙拂尘; 邱钊凡; 梅涛
Original assignee: Beijing Zhixiang Future Technology Co ltd
Current assignee: Beijing Zhixiang Future Technology Co ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-04-05

Abstract

The application provides a video super-resolution method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring video and Gaussian noise; inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model; the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module, a time domain alignment module and a regulator; and the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency. According to the method, the high-frequency resolution video with continuity of inter-frame details is generated through the pre-trained video super-resolution model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.

Description

Video super-resolution method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a video super-resolution method, apparatus, device, and storage medium.

Background

The video super-resolution method can generate a corresponding high-resolution video from a given low-resolution video.

Existing video super-resolution methods can be divided into three categories: a method based on a traditional interpolation model, a method based on a deep learning characterization model and a method based on a pre-training generation model. The first type of method uses interpolation methods (e.g., linear interpolation, bicubic interpolation) of image signals to spatially upsample each video frame to obtain a higher resolution video. The second type of method uses a characterization model (e.g., a recurrent neural network) in deep learning to predict a high resolution video corresponding to an input low resolution video by learning a large amount of data of the low resolution and corresponding high resolution video. The third type of method is conceptually similar to the second type of method, but uses low-resolution video features to control the pre-trained generation model so that it generates high-resolution video conforming to the low-resolution video content, and the method can enrich the detail content of the result compared with the second type of method.

The existing video super-resolution method can only generate high-resolution images with approximately similar contents according to low-resolution images, but cannot control the continuity between generated video frames, so that the generated result is not real and consistent enough.

Disclosure of Invention

In order to solve one of the technical defects, the application provides a video super-resolution method, a device, equipment and a storage medium.

In a first aspect of the present application, a video super-resolution method is provided, which includes:

acquiring video and Gaussian noise;

inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model;

the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module, a time domain alignment module and a regulator;

and the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency.

Optionally, the image generation model is composed of a variational self-encoder and a denoising network;

the denoising network is used for denoising the hidden variable code of the video added with Gaussian noise to obtain the hidden variable code of the resolution video;

the variable self-encoder comprises a variable encoder and a variable decoder;

a variable encoder for compressing the image data into latent variable encoded data of a potential space;

and a variable decoder for recovering the hidden variable encoded data into image data.

After each cascade module of the denoising network and the variation decoder, a space adaptation module and a time domain alignment module are inserted;

the spatial adaptation module is used for extracting the characteristics of the video so as to perform characteristic transformation;

and the time domain alignment module is used for ensuring the continuity between frames.

Optionally, a spatial adaptation module is used for using each frame characteristic graph g according to the input video ⁱ Predictive amplification factor S ⁱ And a bias coefficient M ⁱ Wherein i is the frame identification of the input video; based on S ⁱ And M ⁱ Affine transformation is performed on the feature map.

Alternatively, by formulaPerforming radiation transformation;

wherein f ⁱ Mu, as a feature map of an ith frame of the input video ⁱ Is f ⁱ Mean, sigma of ⁱ Is f ⁱ Standard deviation of (2).

Optionally, the time domain alignment module is configured to divide the video feature into three-dimensional sliding windows spanning multiple frames, and process the video feature in each sliding window based on a self-attention mechanism.

Optionally, when the time domain alignment module performs self-attention mechanism processing, the following formula is adopted to realize:

wherein Q is the query word extracted in the sliding window, K is the key extracted in the sliding window, V is the value extracted in the sliding window, d is the characteristic channel dimension of the key,is a feature processed by the self-attention mechanism.

Optionally, the adjuster is configured to perform color deviation processing on the video by the following formula:

wherein w is a balance parameter, X _u Video obtained by up-sampling for up-sampler, X _d Generating a model derived decoded video for an image, X _H For high frequency resolution video after color deviation processing,is a mapping function consisting of two-dimensional convolution.

In a second aspect of the present application, there is provided a video super-resolution apparatus, the apparatus comprising:

the acquisition module is used for acquiring video and Gaussian noise;

the processing module is used for inputting the video and Gaussian noise acquired by the acquisition module into the video super-resolution model to acquire a high-frequency resolution video output by the video super-resolution model;

the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module and a time domain alignment module;

In a third aspect of the present application, there is provided an electronic device, including:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method as described in the first aspect above.

In a fourth aspect of the present application, there is provided a computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the method as described in the first aspect above.

The application provides a video super-resolution method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a video; inputting the video into a pre-trained generation model, and obtaining a high-frequency resolution video output by the generation model, wherein the inter-frame details of the high-frequency resolution video have continuity. According to the method, the high-frequency resolution video with continuity of inter-frame details is generated through the pre-trained generation model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a schematic flow chart of a video super-resolution method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a generative model provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a space adaptation module according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a time domain alignment module provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a video super-resolution device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is given with reference to the accompanying drawings, and it is apparent that the described embodiments are only some of the embodiments of the present application and not exhaustive of all the embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

In the process of implementing the application, the inventor finds that the existing video super-resolution method can only generate high-resolution images with approximately similar contents according to low-resolution images, but cannot control the continuity between generated video frames, so that the generated result is not real and consistent enough.

In view of the above problems, embodiments of the present application provide a video super-resolution method, apparatus, device, and storage medium, where the method includes: acquiring a video; inputting the video into a pre-trained generation model, and obtaining a high-frequency resolution video output by the generation model, wherein the inter-frame details of the high-frequency resolution video have continuity. According to the method, the high-frequency resolution video with continuity of inter-frame details is generated through the pre-trained generation model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.

Referring to fig. 1, the present embodiment provides a video super-resolution method, which is implemented as follows:

101, acquiring video and Gaussian noise.

The video acquired in this step is a low resolution video.

102, inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model.

Wherein, the inter-frame details of the high-frequency resolution video have consistency.

For example, the low resolution video X obtained in step 101 is used _L And Gaussian noise signal N ₀ As an input to a video super-resolution model, the video super-resolution model is based on a low-resolution video X _L And Gaussian noise signal N ₀ Generating corresponding high resolution video X _H The high resolution video X _H Not only can the details conforming to the low-resolution video content be restored, but also the consistency of the details between frames needs to be maintained. The X is _H I.e. the high frequency resolution video obtained by the method provided by the present embodiment.

As shown in fig. 2, the video super-resolution model includes: a trained image generation model, an upsampler, a spatial adaptation module, a temporal alignment module, and a regulator. And the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency.

1. Image generation model

The image generation model is composed of a variational self-encoder (VAE) and a denoising network (UNet).

Wherein, the denoising network is used for adding Gaussian noise N ₀ Video X of (a) _L Denoising the hidden variable code of the resolution video to obtain the hidden variable code of the resolution video. And after the denoising network executes the set multi-step denoising process, obtaining the hidden variable code of the high-resolution video. The variable self-encoder and the denoising network become a high-quality image generation model under the training of hundreds of millions of high-quality images, and parameters of the network are fixed in the subsequent process, so that the learned high-quality image knowledge is kept.

The variable self-encoder includes a variable encoder and a variable decoder.

A variable encoder for compressing the image data into latent variable encoded data of the potential space.

In order to restore details conforming to the low resolution video content and maintain continuity of inter-frame details, a spatial adaptation module and a temporal alignment module may be inserted after each cascade module of the denoising network and the variational decoder, and both the denoising process and the reconstruction process of the video may be guided (as shown in fig. 1) to achieve finer control.

In particular implementations, the image generation model may be a Diffusion model (Stable Diffusion) that can be trained on hundreds of millions of high quality pictures.

2. Up-sampler

Conventional image or video upsampling typically uses bicubic or linear interpolation methods that have some damage to the original local structure of the video, and this uncertainty is amplified with the randomness of the generated model. Therefore, the upsampler used in this embodiment is implemented by using a temporal mutual attention mechanism and a pixel rearrangement mechanism (such as Upscaler in fig. 1), so as to obtain a more accurate upsampled video to be input into the subsequent image generation model.

3. Space adaptation module (Spatial Feature Adaption SFA)

And the space adaptation module is used for extracting the characteristics of the video so as to perform characteristic transformation.

Specifically, the spatial adaptation module is used for using each frame characteristic graph g according to the input video ⁱ Predictive amplification factor S ⁱ And a bias coefficient M ⁱ Where i is the frame identification of the incoming video. Based on s ⁱ And M ⁱ Affine transformation is performed on the feature map.

As by formulaAnd performing radiation conversion.

The spatial adaptation module directs feature transformations generated by the diffusion model from features extracted from the input video. FIG. 3 shows a specific architecture of a spatial adaptation module that uses a two-dimensional convolution layer to up-sample a frame-by-frame feature map g of video ⁱ Predicting an increaseAmplitude coefficient S ⁱ And a bias factor M ⁱ The normalized generated feature map is then affine transformed using the two coefficients, formally defined as

The affine transformed feature map is integrated with the information of the original video, so that the visual content control of the spatial domain is realized; in addition, because affine transformation coefficients are all predicted for each pixel position, accurate regulation and control of pixel level are facilitated.

4. Time domain alignment module (Temporal Feature Alignment, TFA)

And the time domain alignment module is used for ensuring the continuity between frames, and the structure of the time domain alignment module is shown in fig. 4.

And the time domain alignment module is used for dividing the video characteristic into three-dimensional sliding windows crossing multiple frames, and processing the video characteristic in each sliding window based on a self-attention mechanism.

When the time domain alignment module performs self-attention mechanism processing, the method is realized by adopting the following formula:

In order to ensure the continuity between frames, a time domain alignment module is connected to each space adaptation module. Fig. 4 shows a specific structure of a time domain alignment module. In particular, the temporal alignment module divides the generated video features into three-dimensional sliding windows across multiple frames, and then uses a self-attention mechanism within each sliding window,therein, whereinQ, K, V are all the features +.>The extracted Query terms (Query), keys (Key) and values (Value), d are the characteristic channel dimensions of the keys.

In particular implementations, a cross-attention operation may also be cascaded after the self-attention operation to process the generated video feature and the original video feature, e.gWherein Q is the intra-sliding-window feature from which video is generated ∈ ->Extracted query words, and K and V are features G in sliding window corresponding to original video _tub Key and value extracted from the key. The time domain alignment module uses self-attention operation to enable information among generated video feature frames to be interactively fused, and uses cross-attention operation to further correct the generated video features by using original video features, so that alignment of the generated video features in the time domain is achieved.

5. Regulator

A regulator for performing color deviation processing on the video by the following formula:

wherein w is a balance parameter, X _u Video obtained by up-sampling for up-sampler, X _d Generating a model derived decoded video for an image, X _H For the high-frequency resolution video after the color deviation processing (namely, the high-frequency resolution video finally output by the video super-resolution model),is a mapping function consisting of two-dimensional convolution.

Images generated due to image generation modelsThe regulator can regulate the image output by the image generation model to correct the color. FirstWhere w is a trade-off parameter. The adjuster can balance the input original video content and the synthesized video content by means of feature learning, thereby achieving better effects in generating image quality and color fidelity.

The video super-resolution method provided by the embodiment is a video super-resolution method realized by an overregulating image diffusion model, and the high-resolution video is generated by utilizing a pre-trained generation model on large-scale high-definition image data through the low-resolution video, and the accurate control of the output content of the generation model is realized, so that the generated high-resolution video has rich and fidelity details and smooth consistency.

The video super-resolution method provided by the embodiment can take the video with low resolution as input to generate the corresponding high-quality high-frequency resolution video. In addition, the spatial adaptation module and the time domain alignment module adopted by the embodiment can effectively ensure that the detail of the image content in the generated high-frequency resolution video frame is rich and fidelity, and the consistency of the image content among frames is ensured.

The embodiment provides a video super-resolution method, which is used for acquiring video and Gaussian noise; inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model; the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module, a time domain alignment module and a regulator; and the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency. According to the method provided by the embodiment, the high-frequency resolution video with continuity of inter-frame details is generated through the pre-trained video super-resolution model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.

Based on the same inventive concept of the video super-resolution method, this embodiment provides a video super-resolution method device, see fig. 5, which includes:

an acquisition module 501 is configured to acquire video and gaussian noise.

The processing module 502 is configured to input the video and gaussian noise acquired by the acquiring module 501 into the video super-resolution model, and acquire a high-frequency resolution video output by the video super-resolution model.

Wherein the image generation model is composed of a variable self-encoder and a denoising network.

The denoising network is used for denoising the hidden variable code of the video added with Gaussian noise to obtain the hidden variable code of the resolution video.

The variable self-encoder includes a variable encoder and a variable decoder.

And each cascade module of the denoising network and the variational decoder is inserted into a space adaptation module and a time domain alignment module.

Wherein, the space adaptation module is used for using each frame characteristic graph g according to the input video ⁱ Predictive amplification factor S ⁱ And a bias coefficient M ⁱ Where i is the frame identification of the incoming video. Based on S ⁱ And M ⁱ Affine transformation is performed on the feature map.

Wherein, through the formulaAnd performing radiation conversion.

The time domain alignment module is used for dividing the video features into three-dimensional sliding windows crossing multiple frames, and processing the video features in each sliding window based on a self-attention mechanism.

When the time domain alignment module performs self-attention mechanism processing, the following formula is adopted for realizing:

Wherein, the regulator is used for carrying out color deviation processing on the video through the following formula:

According to the device provided by the embodiment, the high-frequency resolution video with the continuity of the inter-frame details is generated through the pre-trained generation model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.

Based on the same inventive concept of the video super-resolution method, this embodiment provides an electronic device, as shown in fig. 6, including: memory 601, processor 602, and computer programs.

Wherein a computer program is stored in the memory 601 and configured to be executed by the processor 602 to implement the video super resolution method described above.

In particular, the method comprises the steps of,

the video super-resolution model comprises: a trained image generation model, an upsampler, a spatial adaptation module, and a temporal alignment module.

Video and gaussian noise are acquired.

And inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model.

The video super-resolution model comprises: a trained image generation model, an upsampler, a spatial adaptation module, a temporal alignment module, and a regulator.

Optionally, the image generation model is composed of a variational self-encoder and a denoising network.

The variable self-encoder includes a variable encoder and a variable decoder.

Optionally, a spatial adaptation module is used for using each frame characteristic graph g according to the input video ⁱ Predictive amplification factor S ⁱ And a bias coefficient M ⁱ Where i is the frame identification of the incoming video. Based on S ⁱ And M ⁱ Affine transformation is performed on the feature map.

Alternatively, by formulaAnd performing radiation conversion.

The electronic device provided in this embodiment, on which the computer program is executed by the processor, generates, by means of a pre-trained generation model, a high-frequency resolution video with inter-frame details having a consistency, the high-frequency resolution video having rich and fidelity details and a fluent consistency.

Based on the same inventive concept of the video super-resolution method, the present embodiment provides a computer-readable storage medium, and a computer program stored thereon. The computer program is executed by the processor to implement the video super-resolution method described above.

In particular, the method comprises the steps of,

Video and gaussian noise are acquired.

The variable self-encoder includes a variable encoder and a variable decoder.

Alternatively, by formulaAnd performing radiation conversion.

The computer readable storage medium provided in this embodiment, on which a computer program is executed by a processor to generate a high-frequency resolution video with inter-frame details having continuity through a pre-trained generation model, the high-frequency resolution video having rich and fidelity details and smooth continuity.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The solutions in the embodiments of the present application may be implemented in various computer languages, for example, object-oriented programming language Java, and an transliterated scripting language JavaScript, etc.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A video super-resolution method, the method comprising:

acquiring video and Gaussian noise;

the time domain alignment module is used for ensuring continuity of inter-frame details of the high-frequency resolution video.

2. The method of claim 1, wherein the image generation model is comprised of a variational self-encoder and a denoising network;

the variable self-encoder comprises a variable encoder and a variable decoder;

the variable encoder is used for compressing the image data into hidden variable coded data of potential space;

the variation decoder is used for recovering the hidden variable coded data into image data;

each cascade module of the denoising network and the variational decoder is inserted into the space adaptation module and the time domain alignment module;

the space adaptation module is used for extracting the characteristics of the video so as to perform characteristic transformation;

the time domain alignment module is used for guaranteeing the continuity between frames.

3. The method according to claim 2, wherein the spatial adaptation module is configured to use a respective frame profile g according to the input video ⁱ Predictive amplification factor S ⁱ And a bias coefficient M ⁱ Wherein i is the frame identification of the input video; based on the S ⁱ And M ⁱ Affine transformation is performed on the feature map.

4. A method according to claim 3, characterized in thatBy the formula Performing radiation transformation;

5. The method of claim 2, wherein the temporal alignment module is configured to divide the video features into three-dimensional sliding windows across multiple frames, and wherein processing is performed within each sliding window based on a self-attention mechanism.

6. The method of claim 5, wherein the self-attention mechanism processing performed by the time-domain alignment module is implemented using the following formula:

7. The method of claim 1, wherein the adjuster is configured to color bias the video by:

wherein the method comprises the steps ofW is a trade-off parameter, X _u Video obtained by up-sampling for up-sampler, X _d Generating a model derived decoded video for an image, X _H For high frequency resolution video after color deviation processing,is a mapping function consisting of two-dimensional convolution.

8. A video super-resolution apparatus, the apparatus comprising:

the acquisition module is used for acquiring video and Gaussian noise;

the processing module is used for inputting the video and Gaussian noise acquired by the acquisition module into a video super-resolution model to acquire a high-frequency resolution video output by the video super-resolution model;

9. An electronic device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon; the computer program being executed by a processor to implement the method of any of claims 1-7.