CN117830099A - Video super-resolution method, device, equipment and storage medium - Google Patents

Video super-resolution method, device, equipment and storage medium Download PDF

Info

Publication number
CN117830099A
CN117830099A CN202311818992.4A CN202311818992A CN117830099A CN 117830099 A CN117830099 A CN 117830099A CN 202311818992 A CN202311818992 A CN 202311818992A CN 117830099 A CN117830099 A CN 117830099A
Authority
CN
China
Prior art keywords
video
resolution
super
module
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311818992.4A
Other languages
Chinese (zh)
Inventor
姚霆
龙拂尘
邱钊凡
梅涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhixiang Future Technology Co ltd
Original Assignee
Beijing Zhixiang Future Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhixiang Future Technology Co ltd filed Critical Beijing Zhixiang Future Technology Co ltd
Priority to CN202311818992.4A priority Critical patent/CN117830099A/en
Publication of CN117830099A publication Critical patent/CN117830099A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Processing (AREA)

Abstract

The application provides a video super-resolution method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring video and Gaussian noise; inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model; the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module, a time domain alignment module and a regulator; and the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency. According to the method, the high-frequency resolution video with continuity of inter-frame details is generated through the pre-trained video super-resolution model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.

Description

Video super-resolution method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to a video super-resolution method, apparatus, device, and storage medium.
Background
The video super-resolution method can generate a corresponding high-resolution video from a given low-resolution video.
Existing video super-resolution methods can be divided into three categories: a method based on a traditional interpolation model, a method based on a deep learning characterization model and a method based on a pre-training generation model. The first type of method uses interpolation methods (e.g., linear interpolation, bicubic interpolation) of image signals to spatially upsample each video frame to obtain a higher resolution video. The second type of method uses a characterization model (e.g., a recurrent neural network) in deep learning to predict a high resolution video corresponding to an input low resolution video by learning a large amount of data of the low resolution and corresponding high resolution video. The third type of method is conceptually similar to the second type of method, but uses low-resolution video features to control the pre-trained generation model so that it generates high-resolution video conforming to the low-resolution video content, and the method can enrich the detail content of the result compared with the second type of method.
The existing video super-resolution method can only generate high-resolution images with approximately similar contents according to low-resolution images, but cannot control the continuity between generated video frames, so that the generated result is not real and consistent enough.
Disclosure of Invention
In order to solve one of the technical defects, the application provides a video super-resolution method, a device, equipment and a storage medium.
In a first aspect of the present application, a video super-resolution method is provided, which includes:
acquiring video and Gaussian noise;
inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model;
the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module, a time domain alignment module and a regulator;
and the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency.
Optionally, the image generation model is composed of a variational self-encoder and a denoising network;
the denoising network is used for denoising the hidden variable code of the video added with Gaussian noise to obtain the hidden variable code of the resolution video;
the variable self-encoder comprises a variable encoder and a variable decoder;
a variable encoder for compressing the image data into latent variable encoded data of a potential space;
and a variable decoder for recovering the hidden variable encoded data into image data.
After each cascade module of the denoising network and the variation decoder, a space adaptation module and a time domain alignment module are inserted;
the spatial adaptation module is used for extracting the characteristics of the video so as to perform characteristic transformation;
and the time domain alignment module is used for ensuring the continuity between frames.
Optionally, a spatial adaptation module is used for using each frame characteristic graph g according to the input video i Predictive amplification factor S i And a bias coefficient M i Wherein i is the frame identification of the input video; based on S i And M i Affine transformation is performed on the feature map.
Alternatively, by formulaPerforming radiation transformation;
wherein f i Mu, as a feature map of an ith frame of the input video i Is f i Mean, sigma of i Is f i Standard deviation of (2).
Optionally, the time domain alignment module is configured to divide the video feature into three-dimensional sliding windows spanning multiple frames, and process the video feature in each sliding window based on a self-attention mechanism.
Optionally, when the time domain alignment module performs self-attention mechanism processing, the following formula is adopted to realize:
wherein Q is the query word extracted in the sliding window, K is the key extracted in the sliding window, V is the value extracted in the sliding window, d is the characteristic channel dimension of the key,is a feature processed by the self-attention mechanism.
Optionally, the adjuster is configured to perform color deviation processing on the video by the following formula:
wherein w is a balance parameter, X u Video obtained by up-sampling for up-sampler, X d Generating a model derived decoded video for an image, X H For high frequency resolution video after color deviation processing,is a mapping function consisting of two-dimensional convolution.
In a second aspect of the present application, there is provided a video super-resolution apparatus, the apparatus comprising:
the acquisition module is used for acquiring video and Gaussian noise;
the processing module is used for inputting the video and Gaussian noise acquired by the acquisition module into the video super-resolution model to acquire a high-frequency resolution video output by the video super-resolution model;
the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module and a time domain alignment module;
and the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency.
In a third aspect of the present application, there is provided an electronic device, including:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method as described in the first aspect above.
In a fourth aspect of the present application, there is provided a computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the method as described in the first aspect above.
The application provides a video super-resolution method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a video; inputting the video into a pre-trained generation model, and obtaining a high-frequency resolution video output by the generation model, wherein the inter-frame details of the high-frequency resolution video have continuity. According to the method, the high-frequency resolution video with continuity of inter-frame details is generated through the pre-trained generation model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is a schematic flow chart of a video super-resolution method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a generative model provided in an embodiment of the present application;
fig. 3 is a schematic structural diagram of a space adaptation module according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a time domain alignment module provided in an embodiment of the present application;
fig. 5 is a schematic structural diagram of a video super-resolution device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is given with reference to the accompanying drawings, and it is apparent that the described embodiments are only some of the embodiments of the present application and not exhaustive of all the embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
In the process of implementing the application, the inventor finds that the existing video super-resolution method can only generate high-resolution images with approximately similar contents according to low-resolution images, but cannot control the continuity between generated video frames, so that the generated result is not real and consistent enough.
In view of the above problems, embodiments of the present application provide a video super-resolution method, apparatus, device, and storage medium, where the method includes: acquiring a video; inputting the video into a pre-trained generation model, and obtaining a high-frequency resolution video output by the generation model, wherein the inter-frame details of the high-frequency resolution video have continuity. According to the method, the high-frequency resolution video with continuity of inter-frame details is generated through the pre-trained generation model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.
Referring to fig. 1, the present embodiment provides a video super-resolution method, which is implemented as follows:
101, acquiring video and Gaussian noise.
The video acquired in this step is a low resolution video.
102, inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model.
Wherein, the inter-frame details of the high-frequency resolution video have consistency.
For example, the low resolution video X obtained in step 101 is used L And Gaussian noise signal N 0 As an input to a video super-resolution model, the video super-resolution model is based on a low-resolution video X L And Gaussian noise signal N 0 Generating corresponding high resolution video X H The high resolution video X H Not only can the details conforming to the low-resolution video content be restored, but also the consistency of the details between frames needs to be maintained. The X is H I.e. the high frequency resolution video obtained by the method provided by the present embodiment.
As shown in fig. 2, the video super-resolution model includes: a trained image generation model, an upsampler, a spatial adaptation module, a temporal alignment module, and a regulator. And the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency.
1. Image generation model
The image generation model is composed of a variational self-encoder (VAE) and a denoising network (UNet).
Wherein, the denoising network is used for adding Gaussian noise N 0 Video X of (a) L Denoising the hidden variable code of the resolution video to obtain the hidden variable code of the resolution video. And after the denoising network executes the set multi-step denoising process, obtaining the hidden variable code of the high-resolution video. The variable self-encoder and the denoising network become a high-quality image generation model under the training of hundreds of millions of high-quality images, and parameters of the network are fixed in the subsequent process, so that the learned high-quality image knowledge is kept.
The variable self-encoder includes a variable encoder and a variable decoder.
A variable encoder for compressing the image data into latent variable encoded data of the potential space.
And a variable decoder for recovering the hidden variable encoded data into image data.
In order to restore details conforming to the low resolution video content and maintain continuity of inter-frame details, a spatial adaptation module and a temporal alignment module may be inserted after each cascade module of the denoising network and the variational decoder, and both the denoising process and the reconstruction process of the video may be guided (as shown in fig. 1) to achieve finer control.
In particular implementations, the image generation model may be a Diffusion model (Stable Diffusion) that can be trained on hundreds of millions of high quality pictures.
2. Up-sampler
Conventional image or video upsampling typically uses bicubic or linear interpolation methods that have some damage to the original local structure of the video, and this uncertainty is amplified with the randomness of the generated model. Therefore, the upsampler used in this embodiment is implemented by using a temporal mutual attention mechanism and a pixel rearrangement mechanism (such as Upscaler in fig. 1), so as to obtain a more accurate upsampled video to be input into the subsequent image generation model.
3. Space adaptation module (Spatial Feature Adaption SFA)
And the space adaptation module is used for extracting the characteristics of the video so as to perform characteristic transformation.
Specifically, the spatial adaptation module is used for using each frame characteristic graph g according to the input video i Predictive amplification factor S i And a bias coefficient M i Where i is the frame identification of the incoming video. Based on s i And M i Affine transformation is performed on the feature map.
As by formulaAnd performing radiation conversion.
Wherein f i Mu, as a feature map of an ith frame of the input video i Is f i Mean, sigma of i Is f i Standard deviation of (2).
The spatial adaptation module directs feature transformations generated by the diffusion model from features extracted from the input video. FIG. 3 shows a specific architecture of a spatial adaptation module that uses a two-dimensional convolution layer to up-sample a frame-by-frame feature map g of video i Predicting an increaseAmplitude coefficient S i And a bias factor M i The normalized generated feature map is then affine transformed using the two coefficients, formally defined as
The affine transformed feature map is integrated with the information of the original video, so that the visual content control of the spatial domain is realized; in addition, because affine transformation coefficients are all predicted for each pixel position, accurate regulation and control of pixel level are facilitated.
4. Time domain alignment module (Temporal Feature Alignment, TFA)
And the time domain alignment module is used for ensuring the continuity between frames, and the structure of the time domain alignment module is shown in fig. 4.
And the time domain alignment module is used for dividing the video characteristic into three-dimensional sliding windows crossing multiple frames, and processing the video characteristic in each sliding window based on a self-attention mechanism.
When the time domain alignment module performs self-attention mechanism processing, the method is realized by adopting the following formula:
wherein Q is the query word extracted in the sliding window, K is the key extracted in the sliding window, V is the value extracted in the sliding window, d is the characteristic channel dimension of the key,is a feature processed by the self-attention mechanism.
In order to ensure the continuity between frames, a time domain alignment module is connected to each space adaptation module. Fig. 4 shows a specific structure of a time domain alignment module. In particular, the temporal alignment module divides the generated video features into three-dimensional sliding windows across multiple frames, and then uses a self-attention mechanism within each sliding window,therein, whereinQ, K, V are all the features +.>The extracted Query terms (Query), keys (Key) and values (Value), d are the characteristic channel dimensions of the keys.
In particular implementations, a cross-attention operation may also be cascaded after the self-attention operation to process the generated video feature and the original video feature, e.gWherein Q is the intra-sliding-window feature from which video is generated ∈ ->Extracted query words, and K and V are features G in sliding window corresponding to original video tub Key and value extracted from the key. The time domain alignment module uses self-attention operation to enable information among generated video feature frames to be interactively fused, and uses cross-attention operation to further correct the generated video features by using original video features, so that alignment of the generated video features in the time domain is achieved.
5. Regulator
A regulator for performing color deviation processing on the video by the following formula:
wherein w is a balance parameter, X u Video obtained by up-sampling for up-sampler, X d Generating a model derived decoded video for an image, X H For the high-frequency resolution video after the color deviation processing (namely, the high-frequency resolution video finally output by the video super-resolution model),is a mapping function consisting of two-dimensional convolution.
Images generated due to image generation modelsThe regulator can regulate the image output by the image generation model to correct the color. FirstWhere w is a trade-off parameter. The adjuster can balance the input original video content and the synthesized video content by means of feature learning, thereby achieving better effects in generating image quality and color fidelity.
The video super-resolution method provided by the embodiment is a video super-resolution method realized by an overregulating image diffusion model, and the high-resolution video is generated by utilizing a pre-trained generation model on large-scale high-definition image data through the low-resolution video, and the accurate control of the output content of the generation model is realized, so that the generated high-resolution video has rich and fidelity details and smooth consistency.
The video super-resolution method provided by the embodiment can take the video with low resolution as input to generate the corresponding high-quality high-frequency resolution video. In addition, the spatial adaptation module and the time domain alignment module adopted by the embodiment can effectively ensure that the detail of the image content in the generated high-frequency resolution video frame is rich and fidelity, and the consistency of the image content among frames is ensured.
The embodiment provides a video super-resolution method, which is used for acquiring video and Gaussian noise; inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model; the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module, a time domain alignment module and a regulator; and the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency. According to the method provided by the embodiment, the high-frequency resolution video with continuity of inter-frame details is generated through the pre-trained video super-resolution model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.
Based on the same inventive concept of the video super-resolution method, this embodiment provides a video super-resolution method device, see fig. 5, which includes:
an acquisition module 501 is configured to acquire video and gaussian noise.
The processing module 502 is configured to input the video and gaussian noise acquired by the acquiring module 501 into the video super-resolution model, and acquire a high-frequency resolution video output by the video super-resolution model.
Wherein the image generation model is composed of a variable self-encoder and a denoising network.
The denoising network is used for denoising the hidden variable code of the video added with Gaussian noise to obtain the hidden variable code of the resolution video.
The variable self-encoder includes a variable encoder and a variable decoder.
A variable encoder for compressing the image data into latent variable encoded data of the potential space.
And a variable decoder for recovering the hidden variable encoded data into image data.
And each cascade module of the denoising network and the variational decoder is inserted into a space adaptation module and a time domain alignment module.
And the space adaptation module is used for extracting the characteristics of the video so as to perform characteristic transformation.
And the time domain alignment module is used for ensuring the continuity between frames.
Wherein, the space adaptation module is used for using each frame characteristic graph g according to the input video i Predictive amplification factor S i And a bias coefficient M i Where i is the frame identification of the incoming video. Based on S i And M i Affine transformation is performed on the feature map.
Wherein, through the formulaAnd performing radiation conversion.
Wherein f i Mu, as a feature map of an ith frame of the input video i Is f i Mean, sigma of i Is f i Standard deviation of (2).
The time domain alignment module is used for dividing the video features into three-dimensional sliding windows crossing multiple frames, and processing the video features in each sliding window based on a self-attention mechanism.
When the time domain alignment module performs self-attention mechanism processing, the following formula is adopted for realizing:
wherein Q is the query word extracted in the sliding window, K is the key extracted in the sliding window, V is the value extracted in the sliding window, d is the characteristic channel dimension of the key,is a feature processed by the self-attention mechanism.
Wherein, the regulator is used for carrying out color deviation processing on the video through the following formula:
wherein w is a balance parameter, X u Video obtained by up-sampling for up-sampler, X d Generating a model derived decoded video for an image, X H For high frequency resolution video after color deviation processing,is a mapping function consisting of two-dimensional convolution.
According to the device provided by the embodiment, the high-frequency resolution video with the continuity of the inter-frame details is generated through the pre-trained generation model, and the high-frequency resolution video has rich and fidelity details and smooth continuity.
Based on the same inventive concept of the video super-resolution method, this embodiment provides an electronic device, as shown in fig. 6, including: memory 601, processor 602, and computer programs.
Wherein a computer program is stored in the memory 601 and configured to be executed by the processor 602 to implement the video super resolution method described above.
In particular, the method comprises the steps of,
the video super-resolution model comprises: a trained image generation model, an upsampler, a spatial adaptation module, and a temporal alignment module.
And the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency.
Video and gaussian noise are acquired.
And inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model.
The video super-resolution model comprises: a trained image generation model, an upsampler, a spatial adaptation module, a temporal alignment module, and a regulator.
And the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency.
Optionally, the image generation model is composed of a variational self-encoder and a denoising network.
The denoising network is used for denoising the hidden variable code of the video added with Gaussian noise to obtain the hidden variable code of the resolution video.
The variable self-encoder includes a variable encoder and a variable decoder.
A variable encoder for compressing the image data into latent variable encoded data of the potential space.
And a variable decoder for recovering the hidden variable encoded data into image data.
And each cascade module of the denoising network and the variational decoder is inserted into a space adaptation module and a time domain alignment module.
And the space adaptation module is used for extracting the characteristics of the video so as to perform characteristic transformation.
And the time domain alignment module is used for ensuring the continuity between frames.
Optionally, a spatial adaptation module is used for using each frame characteristic graph g according to the input video i Predictive amplification factor S i And a bias coefficient M i Where i is the frame identification of the incoming video. Based on S i And M i Affine transformation is performed on the feature map.
Alternatively, by formulaAnd performing radiation conversion.
Wherein f i Mu, as a feature map of an ith frame of the input video i Is f i Mean, sigma of i Is f i Standard deviation of (2).
Optionally, the time domain alignment module is configured to divide the video feature into three-dimensional sliding windows spanning multiple frames, and process the video feature in each sliding window based on a self-attention mechanism.
Optionally, when the time domain alignment module performs self-attention mechanism processing, the following formula is adopted to realize:
wherein Q is the query word extracted in the sliding window, K is the key extracted in the sliding window, V is the value extracted in the sliding window, d is the characteristic channel dimension of the key,is a feature processed by the self-attention mechanism.
Optionally, the adjuster is configured to perform color deviation processing on the video by the following formula:
wherein w is a balance parameter, X u Video obtained by up-sampling for up-sampler, X d Generating a model derived decoded video for an image, X H For high frequency resolution video after color deviation processing,is a mapping function consisting of two-dimensional convolution.
The electronic device provided in this embodiment, on which the computer program is executed by the processor, generates, by means of a pre-trained generation model, a high-frequency resolution video with inter-frame details having a consistency, the high-frequency resolution video having rich and fidelity details and a fluent consistency.
Based on the same inventive concept of the video super-resolution method, the present embodiment provides a computer-readable storage medium, and a computer program stored thereon. The computer program is executed by the processor to implement the video super-resolution method described above.
In particular, the method comprises the steps of,
the video super-resolution model comprises: a trained image generation model, an upsampler, a spatial adaptation module, and a temporal alignment module.
And the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency.
Video and gaussian noise are acquired.
And inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model.
The video super-resolution model comprises: a trained image generation model, an upsampler, a spatial adaptation module, a temporal alignment module, and a regulator.
And the time domain alignment module is used for ensuring that the inter-frame details of the high-frequency resolution video have consistency.
Optionally, the image generation model is composed of a variational self-encoder and a denoising network.
The denoising network is used for denoising the hidden variable code of the video added with Gaussian noise to obtain the hidden variable code of the resolution video.
The variable self-encoder includes a variable encoder and a variable decoder.
A variable encoder for compressing the image data into latent variable encoded data of the potential space.
And a variable decoder for recovering the hidden variable encoded data into image data.
And each cascade module of the denoising network and the variational decoder is inserted into a space adaptation module and a time domain alignment module.
And the space adaptation module is used for extracting the characteristics of the video so as to perform characteristic transformation.
And the time domain alignment module is used for ensuring the continuity between frames.
Optionally, a spatial adaptation module is used for using each frame characteristic graph g according to the input video i Predictive amplification factor S i And a bias coefficient M i Where i is the frame identification of the incoming video. Based on S i And M i Affine transformation is performed on the feature map.
Alternatively, by formulaAnd performing radiation conversion.
Wherein f i Mu, as a feature map of an ith frame of the input video i Is f i Mean, sigma of i Is f i Standard deviation of (2).
Optionally, the time domain alignment module is configured to divide the video feature into three-dimensional sliding windows spanning multiple frames, and process the video feature in each sliding window based on a self-attention mechanism.
Optionally, when the time domain alignment module performs self-attention mechanism processing, the following formula is adopted to realize:
wherein Q is the query word extracted in the sliding window, K is the key extracted in the sliding window, V is the value extracted in the sliding window, d is the characteristic channel dimension of the key,is a feature processed by the self-attention mechanism.
Optionally, the adjuster is configured to perform color deviation processing on the video by the following formula:
wherein w is a balance parameter, X u Video obtained by up-sampling for up-sampler, X d Generating a model derived decoded video for an image, X H For high frequency resolution video after color deviation processing,is a mapping function consisting of two-dimensional convolution.
The computer readable storage medium provided in this embodiment, on which a computer program is executed by a processor to generate a high-frequency resolution video with inter-frame details having continuity through a pre-trained generation model, the high-frequency resolution video having rich and fidelity details and smooth continuity.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The solutions in the embodiments of the present application may be implemented in various computer languages, for example, object-oriented programming language Java, and an transliterated scripting language JavaScript, etc.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (10)

1. A video super-resolution method, the method comprising:
acquiring video and Gaussian noise;
inputting the video and Gaussian noise into a video super-resolution model, and obtaining a high-frequency resolution video output by the video super-resolution model;
the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module, a time domain alignment module and a regulator;
the time domain alignment module is used for ensuring continuity of inter-frame details of the high-frequency resolution video.
2. The method of claim 1, wherein the image generation model is comprised of a variational self-encoder and a denoising network;
the denoising network is used for denoising the hidden variable code of the video added with Gaussian noise to obtain the hidden variable code of the resolution video;
the variable self-encoder comprises a variable encoder and a variable decoder;
the variable encoder is used for compressing the image data into hidden variable coded data of potential space;
the variation decoder is used for recovering the hidden variable coded data into image data;
each cascade module of the denoising network and the variational decoder is inserted into the space adaptation module and the time domain alignment module;
the space adaptation module is used for extracting the characteristics of the video so as to perform characteristic transformation;
the time domain alignment module is used for guaranteeing the continuity between frames.
3. The method according to claim 2, wherein the spatial adaptation module is configured to use a respective frame profile g according to the input video i Predictive amplification factor S i And a bias coefficient M i Wherein i is the frame identification of the input video; based on the S i And M i Affine transformation is performed on the feature map.
4. A method according to claim 3, characterized in thatBy the formula Performing radiation transformation;
wherein f i Mu, as a feature map of an ith frame of the input video i Is f i Mean, sigma of i Is f i Standard deviation of (2).
5. The method of claim 2, wherein the temporal alignment module is configured to divide the video features into three-dimensional sliding windows across multiple frames, and wherein processing is performed within each sliding window based on a self-attention mechanism.
6. The method of claim 5, wherein the self-attention mechanism processing performed by the time-domain alignment module is implemented using the following formula:
wherein Q is the query word extracted in the sliding window, K is the key extracted in the sliding window, V is the value extracted in the sliding window, d is the characteristic channel dimension of the key,is a feature processed by the self-attention mechanism.
7. The method of claim 1, wherein the adjuster is configured to color bias the video by:
wherein the method comprises the steps ofW is a trade-off parameter, X u Video obtained by up-sampling for up-sampler, X d Generating a model derived decoded video for an image, X H For high frequency resolution video after color deviation processing,is a mapping function consisting of two-dimensional convolution.
8. A video super-resolution apparatus, the apparatus comprising:
the acquisition module is used for acquiring video and Gaussian noise;
the processing module is used for inputting the video and Gaussian noise acquired by the acquisition module into a video super-resolution model to acquire a high-frequency resolution video output by the video super-resolution model;
the video super-resolution model comprises: the system comprises a trained image generation model, an up-sampler, a space adaptation module and a time domain alignment module;
the time domain alignment module is used for ensuring continuity of inter-frame details of the high-frequency resolution video.
9. An electronic device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon; the computer program being executed by a processor to implement the method of any of claims 1-7.
CN202311818992.4A 2023-12-27 2023-12-27 Video super-resolution method, device, equipment and storage medium Pending CN117830099A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311818992.4A CN117830099A (en) 2023-12-27 2023-12-27 Video super-resolution method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311818992.4A CN117830099A (en) 2023-12-27 2023-12-27 Video super-resolution method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117830099A true CN117830099A (en) 2024-04-05

Family

ID=90516683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311818992.4A Pending CN117830099A (en) 2023-12-27 2023-12-27 Video super-resolution method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117830099A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469884A (en) * 2021-07-15 2021-10-01 长视科技股份有限公司 Video super-resolution method, system, equipment and storage medium based on data simulation
CN115496663A (en) * 2022-10-12 2022-12-20 南京信息工程大学 Video super-resolution reconstruction method based on D3D convolution intra-group fusion network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469884A (en) * 2021-07-15 2021-10-01 长视科技股份有限公司 Video super-resolution method, system, equipment and storage medium based on data simulation
CN115496663A (en) * 2022-10-12 2022-12-20 南京信息工程大学 Video super-resolution reconstruction method based on D3D convolution intra-group fusion network

Similar Documents

Publication Publication Date Title
CN110969577B (en) Video super-resolution reconstruction method based on deep double attention network
CN111587447B (en) Frame-cycled video super-resolution
US20220261965A1 (en) Training method of image processing model, image processing method, apparatus, and device
CN111784570A (en) Video image super-resolution reconstruction method and device
CN105408935B (en) Up-sampling and signal enhancing
CN107231566A (en) A kind of video transcoding method, device and system
CN112529776A (en) Training method of image processing model, image processing method and device
CN116681584A (en) Multistage diffusion image super-resolution algorithm
CN115409716B (en) Video processing method, device, storage medium and equipment
López-Tapia et al. A single video super-resolution GAN for multiple downsampling operators based on pseudo-inverse image formation models
CN108475414B (en) Image processing method and device
Liu et al. Learning noise-decoupled affine models for extreme low-light image enhancement
Agrawal et al. Image resolution enhancement using lifting wavelet and stationary wavelet transform
CN106981046B (en) Single image super resolution ratio reconstruction method based on multi-gradient constrained regression
CN114494022A (en) Model training method, super-resolution reconstruction method, device, equipment and medium
CN113747242A (en) Image processing method, image processing device, electronic equipment and storage medium
Liu et al. Arbitrary-scale super-resolution via deep learning: A comprehensive survey
Alvarez-Ramos et al. Image super-resolution via two coupled dictionaries and sparse representation
Hong et al. Image interpolation using interpolative classified vector quantization
CN116957964A (en) Small sample image generation method and system based on diffusion model
JP5514132B2 (en) Image reduction device, image enlargement device, and program thereof
CN117830099A (en) Video super-resolution method, device, equipment and storage medium
Peng Super-resolution reconstruction using multiconnection deep residual network combined an improved loss function for single-frame image
CN106447610B (en) Image rebuilding method and device
Hui et al. Rate-adaptive neural network for image compressive sensing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination