CN110647934A

CN110647934A - Training method and device for video super-resolution reconstruction model and electronic equipment

Info

Publication number: CN110647934A
Application number: CN201910896838.6A
Authority: CN
Inventors: 李超; 刘霄; 孙昊; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2020-01-03
Anticipated expiration: 2039-09-20
Also published as: CN110647934B

Abstract

The application discloses a training method and device for a video super-resolution reconstruction model and electronic equipment, and relates to the technical field of computer vision. The specific implementation scheme is as follows: acquiring a plurality of sample data, wherein each sample data comprises a video sample with a first resolution and a video sample with a second resolution, and the first resolution is smaller than the second resolution; establishing a video super-resolution reconstruction model; the video super-resolution reconstruction model comprises a feature fusion total module and at least two neural network branches, wherein each neural network branch corresponds to a different scale, and each neural network branch comprises a feature extraction module and a feature fusion module; and training the video super-resolution reconstruction model by adopting a plurality of sample data. The method is beneficial to reserving time domain information in the video, processing dynamic blurring in the video and greatly improving the visual effect of super-resolution.

Description

Training method and device for video super-resolution reconstruction model and electronic equipment

Technical Field

The application relates to the technical field of computers, in particular to the field of computer vision.

Background

Existing super-resolution reconstruction techniques focus primarily on application to images. Although the image super-resolution technology can be directly applied to each frame of a video, the effect of performing super-resolution reconstruction on the video frame by frame is general, the dynamic blurring phenomenon in the video cannot be solved, and the visual experience of a user is poor.

Disclosure of Invention

The embodiment of the application provides a training method and device for a video super-resolution reconstruction model and electronic equipment, so as to solve one or more technical problems in the prior art.

In a first aspect, an embodiment of the present application provides a method for training a video super-resolution reconstruction model, including:

acquiring a plurality of sample data, wherein each sample data comprises a video sample with a first resolution and a video sample with a second resolution, and the first resolution is smaller than the second resolution;

establishing a video super-resolution reconstruction model; the video super-resolution reconstruction model comprises a feature fusion total module and at least two neural network branches, wherein each neural network branch corresponds to a different scale, and each neural network branch comprises a feature extraction module and a feature fusion module; the feature extraction module is used for extracting features corresponding to the scale of the branch according to the input video; the characteristic fusion module is arranged for generating characteristics corresponding to the scale of the branch according to the characteristics of the branch and the characteristics of the adjacent low-scale branches, and taking the generated characteristics as the output characteristics of the branch; the feature fusion total module is used for generating a super-resolution video according to the output features of the plurality of neural network branches;

and training the video super-resolution reconstruction model by adopting a plurality of sample data.

The embodiment combines the multi-scale feature fusion technology, trains a neural network, performs super-resolution processing on the input video, outputs the corresponding video segment after super-resolution, and improves the visual effect of super-resolution.

In one embodiment, the feature fusion module of each branch comprises at least two feature fusion modules;

the feature fusion module is configured to generate a feature corresponding to the scale of the branch according to the feature of the branch and the feature of the adjacent low-scale branch, and use the generated feature as an output feature of the branch, including:

the first feature fusion module is arranged for generating features corresponding to the scale of the branch according to the features output by the feature extraction module of the branch and the features output by the feature extraction modules of the adjacent low-scale branches;

the non-first feature fusion module is arranged for generating features corresponding to the scale of the branch according to the features output by the last feature fusion module of the current branch and the features output by the last feature fusion module of the adjacent low-scale branch;

and taking the feature output by the last feature fusion module as the output feature of the branch.

The embodiment is provided with the multi-layer feature fusion module, so that the multiple fusion of features is realized, and the visual effect of super-resolution is improved.

In one embodiment, for a feature extraction module in which the scale of the neural network branch is smaller than that of the input video, the feature extraction module configured to extract features corresponding to the scale of the branch from the input video comprises: the feature extraction module performs downsampling processing on the input video to obtain a downsampled video corresponding to the scale of the branch where the video is located; and performing feature extraction on the downsampled video to obtain features corresponding to the scale of the branch where the downsampled video is located.

In the above embodiment, the neural network branch may perform downsampling processing on the input video to perform low-scale feature extraction.

In one embodiment, training a video super-resolution reconstruction model using a plurality of sample data includes:

adopting a video sample with a first resolution ratio as an input of a video super-resolution reconstruction model;

and training the video super-resolution reconstruction model by using the video sample with the second resolution as a monitoring signal of each neural network branch and using the video sample with the second resolution as a monitoring signal of the feature fusion total module.

In the above embodiment, not only the output of the model is supervised, but also each neural network branch is supervised, so that the training effect of the model is greatly improved, and the effect of video super-resolution reconstruction is improved.

In one embodiment, using the video samples of the second resolution as supervisory signals for each neural network branch comprises: and for each neural network branch, generating a branch video to be supervised at a second resolution according to the output characteristics of the neural network branch, and supervising the branch video to be supervised at the second resolution by using the video sample at the second resolution.

In a second aspect, an embodiment of the present application provides a method for reconstructing a video super-resolution, including:

receiving an original video to be super-resolved;

inputting an original video into a video super-resolution reconstruction model;

acquiring a video output by the video super-resolution reconstruction model as a super-resolution video;

the video super-resolution reconstruction model comprises a feature fusion total module and at least two neural network branches, wherein each neural network branch corresponds to a different scale, and each neural network branch comprises a feature extraction module and a feature fusion module; the feature extraction module is used for extracting features corresponding to the scale of the branch according to the input video; the characteristic fusion module is arranged for generating characteristics corresponding to the scale of the branch according to the characteristics of the branch and the characteristics of the adjacent low-scale branches, and taking the generated characteristics as the output characteristics of the branch; the feature fusion total module is used for generating the super-resolution video according to the output features of the plurality of neural network branches.

The embodiment adopts the multi-scale feature fusion video super-resolution reconstruction model to carry out super-resolution reconstruction on the video, can process dynamic blurring in the video and improve the visual effect of super-resolution.

In a third aspect, an embodiment of the present application provides a training apparatus for a video super-resolution reconstruction model, including:

the system comprises a sample data acquisition unit, a processing unit and a processing unit, wherein the sample data acquisition unit is used for acquiring a plurality of sample data, each sample data comprises a video sample with a first resolution and a video sample with a second resolution, and the first resolution is smaller than the second resolution;

the model establishing unit is used for establishing a video super-resolution reconstruction model; the video super-resolution reconstruction model comprises a feature fusion total module and at least two neural network branches, wherein each neural network branch corresponds to a different scale, and each neural network branch comprises a feature extraction module and a feature fusion module; the feature extraction module is used for extracting features corresponding to the scale of the branch according to the input video; the characteristic fusion module is arranged for generating characteristics corresponding to the scale of the branch according to the characteristics of the branch and the characteristics of the adjacent low-scale branches, and taking the generated characteristics as the output characteristics of the branch; the feature fusion total module is used for generating a super-resolution video according to the output features of the plurality of neural network branches;

and the training unit is used for training the video super-resolution reconstruction model by adopting a plurality of sample data.

In one embodiment, a training unit, comprising:

the input subunit is used for adopting the video sample with the first resolution as the input of the video super-resolution reconstruction model;

and the monitoring subunit is used for training the video super-resolution reconstruction model by using the video sample with the second resolution as a monitoring signal of each neural network branch and using the video sample with the second resolution as a monitoring signal of the feature fusion total module.

In a fourth aspect, an embodiment of the present application provides an apparatus for reconstructing video super-resolution, including:

the original video receiving unit is used for receiving an original video to be super-resolved;

the original video input unit is used for inputting an original video into a video super-resolution reconstruction model;

the super-resolution video acquisition unit is used for acquiring a video output by the video super-resolution reconstruction model as a super-resolution video;

In a fifth aspect, an embodiment of the present application provides an electronic device, where functions of the electronic device may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions.

In one possible design, the electronic device includes a processor and a memory, the memory is used for storing a program for supporting the electronic device to execute the above-mentioned training method of the video super-resolution reconstruction model or the reconstruction method of the video super-resolution, and the processor is configured to execute the program stored in the memory. The electronic device may also include a communication interface for communicating with other devices or a communication network.

In a sixth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for storing computer software instructions for an electronic device, which includes a program for executing the above-mentioned training method for a video super-resolution reconstruction model or the video super-resolution reconstruction method.

One embodiment in the above application has the following advantages or benefits: and the visual effect of super-resolution is improved. According to the technical scheme of the embodiment of the application, the neural network obtained by combining the multi-scale feature fusion technology is used for processing the video segment, so that the time domain information in the video can be kept, the dynamic blur in the video can be processed, and the visual effect of super-resolution can be greatly improved.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic flow chart of a training method of a video super-resolution reconstruction model according to a first embodiment of the present application;

FIG. 2 is a diagram of an example of a video super-resolution reconstruction model of a training method of the video super-resolution reconstruction model according to a first embodiment of the present application;

FIG. 3 is a diagram illustrating an exemplary structure of a video super-resolution reconstruction model during training according to a method for training a video super-resolution reconstruction model according to a first embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for reconstructing super-resolution video according to a second embodiment of the present application;

FIG. 5 is a block diagram of a training apparatus for a video super-resolution reconstruction model according to a third embodiment of the present application;

fig. 6 is a block diagram illustrating a structure of a training unit 53 of a training apparatus for a video super-resolution reconstruction model according to a third embodiment of the present application;

fig. 7 is a block diagram showing a construction of a reconstruction apparatus for super-resolution video according to a fourth embodiment of the present application;

fig. 8 is a block diagram of an electronic device for implementing a reconstruction method for super-resolution video of the training method for a super-resolution video reconstruction model according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 shows a flowchart of a training method for a video super-resolution reconstruction model according to a first embodiment of the present application, which includes:

s11, obtaining a plurality of sample data, wherein each sample data comprises a video sample with a first resolution and a video sample with a second resolution, and the first resolution is smaller than the second resolution;

s12, establishing a video super-resolution reconstruction model; the video super-resolution reconstruction model comprises a feature fusion total module and at least two neural network branches, wherein each neural network branch corresponds to a different scale, and each neural network branch comprises a feature extraction module and a feature fusion module; the feature extraction module is used for extracting features corresponding to the scale of the branch according to the input video; the characteristic fusion module is arranged for generating characteristics corresponding to the scale of the branch according to the characteristics of the branch and the characteristics of the adjacent low-scale branches, and taking the generated characteristics as the output characteristics of the branch; the feature fusion total module is used for generating a super-resolution video according to the output features of the plurality of neural network branches;

and S13, training the video super-resolution reconstruction model by adopting a plurality of sample data.

It should be noted that, as the corresponding neural network branch with the lowest scale, since there is no adjacent low-scale branch, the feature fusion module of the neural network branch with the lowest scale only receives the feature of the branch.

The resolution of video is a parameter used to measure how much data is in an image, and is usually expressed in ppi (Pixel per inch) for video. Generally, 320X180 of an adjective video refers to its effective pixel in the horizontal and vertical directions, and the resolution in a strict sense refers to an effective pixel value ppi within a unit length.

Super-Resolution reconstruction (Super-Resolution) is a process of improving the Resolution of an original image by a hardware or software method and obtaining a high-Resolution image through a series of low-Resolution images. May be referred to as super-resolution for short.

The above features are of scale. The observed features may be different at different observation scales for the same object. According to the method and the device, the final super-resolution video is generated by fusing the correspondingly extracted features under different scales, and the loss of information is reduced. Under the condition of unchanged calculated amount, the super-resolution visual effect is improved.

In one embodiment, the features output by the feature extraction module and the features output by the feature fusion module may include: time information and spatial information of the pixel points.

In one embodiment, the features output by the feature extraction module and the features output by the feature fusion module may be output in the form of tensor. A tensor is a multi-linear function that can be used to represent a linear relationship between some vectors, scalars, and other tensors.

the first feature fusion module is configured to generate features corresponding to the scale of the branch according to the features output by the feature extraction module of the branch and the features output by the feature extraction modules of the adjacent low-scale branches.

The above-mentioned sorting of the feature fusion modules of each neural network branch, such as the first, non-first and last, is performed in sequence according to the direction of feature data transmission, the first feature fusion module is connected to the feature extraction module, and the output of the last feature fusion module is used as the output of the branch in which it is located.

Correspondingly, for a feature extraction module with a neural network branch having a smaller scale than that of the input video, the feature extraction module may be designed in the form of a video downsampling sub-module + a feature extraction sub-module. Namely, the video down-sampling sub-module is arranged for down-sampling the input video to obtain a down-sampling video corresponding to the scale of the branch where the video down-sampling sub-module is located; and the feature extraction submodule is arranged for extracting features according to the down-sampling video to obtain features corresponding to the scale of the branch where the feature is located.

In one embodiment, among the plurality of neural network branches, a dimension of one of the neural network branches is equal to a dimension of the input video. On one hand, in the multi-scale feature fusion, features extracted based on the scale of the input video are included, and the finally obtained super-resolution video has better visual effect; on the other hand, in the branch, the input video does not need to be subjected to down-sampling processing, so that the calculation amount is reduced, and the efficiency is improved.

In one embodiment, the feature fusion total module is configured to generate a super-resolution video according to output features of a plurality of neural network branches, and includes: the characteristic fusion total module is used for generating preset scale characteristics by using the characteristics of a plurality of scales according to the output characteristics of the plurality of neural network branches; and generating a super-resolution video according to the preset scale characteristics. The preset scale is larger than the corresponding scales of the plurality of neural network branches. The height and the width of the super-resolution video are enlarged by preset times relative to the height and the width of the video input by the video super-resolution reconstruction model. The height indicates the number of pixels in the vertical direction, and the width indicates the number of pixels in the horizontal direction.

And if the video sample with the second resolution is used as a supervision signal, the height and the width of the super-resolution video are enlarged by preset times relative to the height and the width of the video input by the video super-resolution reconstruction model, so that the height and the width of the video sample with the second resolution of the super-resolution video are consistent.

In one embodiment, step S13 includes: adopting a video sample with a first resolution ratio as an input of a video super-resolution reconstruction model; and training the video super-resolution reconstruction model by using the video sample with the second resolution as a monitoring signal of each neural network branch and using the video sample with the second resolution as a monitoring signal of the feature fusion total module. The video samples of the second resolution can also be understood as the expected output values of the video super-resolution reconstruction model.

In an example, referring to the examples of fig. 2 and fig. 3, the video super-resolution reconstruction model provided by the embodiment of the present application includes a feature fusion total module and three neural network branches.

The first neural network branch corresponds to a first scale, the second neural network branch corresponds to a second scale, the third neural network branch corresponds to a third scale, the first scale > the second scale > the third scale, and the first scale is the same as the scale of the input video.

The first neural network branch comprises a first feature extraction module A1, a feature fusion module B11 and a feature fusion module B12; the second neural network branch comprises a second feature extraction module A2, a feature fusion module B21 and a feature fusion module B22; the third neural network branch includes a third feature extraction module a3, a feature fusion module B31, and a feature fusion module B32.

For the first neural network branch, the feature fusion module B11 receives the features output by the first feature extraction module A1 and the features output by the second feature extraction module A2, and generates features of a first scale. The feature fusion module B12 receives the features output by the feature fusion module B11 and the features output by the feature fusion module B21 and generates features of a first scale.

For the second neural network branch, the feature fusion module B21 receives the features output by the second feature extraction module A2 and the features output by the third feature extraction module A3 and generates features at a second scale. The feature fusion module B22 receives the features output by the feature fusion module B21 and the features output by the feature fusion module B31 and generates features at a second scale.

For the third neural network branch, the feature fusion module B31 receives the features output by the third feature extraction module A3 and generates features at a third scale. The feature fusion module B32 receives the features output by the feature fusion module B31 and generates features at a third scale.

It should be noted that, since the scales of the second neural network branch and the third neural network branch are both smaller than the scale of the input video, the second feature extraction module and the third feature extraction module may be designed in the form of "video downsampling sub-module + feature extraction sub-module". Namely, the video down-sampling sub-module is arranged for down-sampling the input video to obtain a down-sampling video corresponding to the scale of the branch where the video down-sampling sub-module is located; and the feature extraction submodule is arranged for extracting features according to the down-sampling video to obtain features corresponding to the scale of the branch where the feature is located. Of course, in other embodiments, the feature extraction module may be designed not to implement the function of downsampling, and a video downsampling module may be added to the second neural network branch and the third neural network branch to implement the downsampling function separately.

In one example, a plurality of branch supervision modules may be provided, and each branch supervision module is connected to an output of a neural network branch, that is, a last feature fusion module connected to the neural network branch. The branch supervision module is used for utilizing the video sample with the second resolution ratio as a supervision signal of each neural network branch to realize supervision on each neural network branch.

In an example, a master supervision module may be further configured, where the master supervision module is configured to supervise an output of the feature fusion master module, that is, to supervise an output of the video super-resolution reconstruction model, by using a video sample with a second resolution as a supervision signal of the feature fusion master module.

The number of the neural network branches and the number of the feature fusion modules given in fig. 2 and fig. 3 are only an example, and those skilled in the art may adjust the number of the neural network branches and the number of the feature fusion modules according to requirements.

Fig. 4 shows a flow chart of a reconstruction method for super-resolution video according to a second embodiment of the present application, and referring to fig. 4, the method includes:

s41, receiving an original video to be super-resolved;

s42, inputting the original video into a video super-resolution reconstruction model;

s43, acquiring a video output by the video super-resolution reconstruction model as a super-resolution video;

The specific content of the video super-resolution reconstruction model of this embodiment may correspond to the description of the training method of the video super-resolution reconstruction model of the first embodiment, and is not described herein again.

Fig. 5 shows a block diagram of a training apparatus 5 for a video super-resolution reconstruction model according to an embodiment of the present application. Referring to fig. 5, it includes:

a sample data obtaining unit 51, configured to obtain a plurality of sample data, where each sample data includes a video sample with a first resolution and a video sample with a second resolution, and the first resolution is smaller than the second resolution;

a model establishing unit 52, configured to establish a video super-resolution reconstruction model; the video super-resolution reconstruction model comprises a feature fusion total module and at least two neural network branches, wherein each neural network branch corresponds to a different scale, and each neural network branch comprises a feature extraction module and a feature fusion module; the feature extraction module is used for extracting features corresponding to the scale of the branch according to the input video; the characteristic fusion module is arranged for generating characteristics corresponding to the scale of the branch according to the characteristics of the branch and the characteristics of the adjacent low-scale branches, and taking the generated characteristics as the output characteristics of the branch; the feature fusion total module is used for generating a super-resolution video according to the output features of the plurality of neural network branches;

and the training unit 53 is configured to train the video super-resolution reconstruction model by using a plurality of sample data.

In one embodiment, referring to fig. 6, the training unit 53, comprises:

an input subunit 61, configured to adopt the video sample with the first resolution as an input of the video super-resolution reconstruction model;

and the monitoring subunit 62 is configured to train the video super-resolution reconstruction model by using the video sample with the second resolution as a monitoring signal of each neural network branch and using the video sample with the second resolution as a monitoring signal of the feature fusion total module.

Fig. 7 shows a block diagram of a reconstruction apparatus 7 for super-resolution video according to an embodiment of the present application. Referring to fig. 7, including:

an original video receiving unit 71 for receiving an original video to be super-resolved;

an original video input unit 72 for inputting an original video into the video super-resolution reconstruction model;

a super-resolution video acquisition unit 73 for acquiring a video output by the video super-resolution reconstruction model as a super-resolution video;

The functions of each module in each apparatus in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 8 is a block diagram of an electronic device for training a video super-resolution reconstruction model or reconstructing a video super-resolution according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display Graphical information for a Graphical User Interface (GUI) on an external input/output device, such as a display device coupled to the Interface. In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform a method for training a video super-resolution reconstruction model or a method for reconstructing video super-resolution provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the training method of the video super-resolution reconstruction model or the reconstruction method of the video super-resolution provided herein.

The memory 802, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the training method of the video super-resolution reconstruction model or the reconstruction method of the video super-resolution in the embodiment of the present application (for example, the sample data acquisition unit 51, the model building unit 52, and the training unit 53 shown in fig. 5). The processor 801 executes various functional applications of the server and data processing, namely, a training method of a video super-resolution reconstruction model or a reconstruction method of video super-resolution in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 802.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to a training method of a video super-resolution reconstruction model or a use of an electronic device of a reconstruction method of video super-resolution, or the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes a memory remotely disposed with respect to the processor 801, and these remote memories may be connected via a network to an electronic device implementing a training method for a video super-resolution reconstruction model or a reconstruction method for video super-resolution. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device implementing the training method of the video super-resolution reconstruction model or the reconstruction method of the video super-resolution may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the training method of the video super-resolution reconstruction model or the reconstruction method of the video super-resolution, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The Display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) Display, and a plasma Display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, Integrated circuitry, Application Specific Integrated Circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (Cathode Ray Tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area networks (wans), and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the input video is subjected to super-resolution processing by combining the neural network obtained by the multi-scale feature fusion technology to obtain a video segment after super-resolution, so that the visual effect of super-resolution is greatly improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A training method of a video super-resolution reconstruction model is characterized by comprising the following steps:

obtaining a plurality of sample data, each sample data comprising a video sample of a first resolution and a video sample of a second resolution, the first resolution being less than the second resolution;

establishing a video super-resolution reconstruction model; the video super-resolution reconstruction model comprises a feature fusion total module and at least two neural network branches, wherein each neural network branch corresponds to different scales, and each neural network branch comprises a feature extraction module and a feature fusion module; the feature extraction module is used for extracting features corresponding to the scale of the branch according to the input video; the characteristic fusion module is arranged for generating characteristics corresponding to the scale of the branch according to the characteristics of the branch and the characteristics of the adjacent low-scale branches, and taking the generated characteristics as the output characteristics of the branch; the feature fusion total module is used for generating a super-resolution video according to the output features of the plurality of neural network branches;

and training the video super-resolution reconstruction model by adopting the plurality of sample data.

2. The method of claim 1, wherein the feature fusion module of each of the branches comprises at least two feature fusion modules;

3. The method of claim 1, wherein for a feature extraction module with the scale of the neural network branch smaller than that of the input video, the feature extraction module configured to extract features corresponding to the scale of the branch from the input video comprises: the characteristic extraction module is used for carrying out downsampling processing on the input video to obtain a downsampled video corresponding to the scale of the branch where the characteristic extraction module is located; and extracting the features of the downsampled video to obtain the features corresponding to the scale of the branch.

4. The method of claim 1, wherein the training the video super-resolution reconstruction model using the plurality of sample data comprises:

adopting the video sample with the first resolution as the input of the video super-resolution reconstruction model;

and training the video super-resolution reconstruction model by using the video sample with the second resolution as a supervisory signal of each neural network branch and using the video sample with the second resolution as a supervisory signal of the feature fusion total module.

5. The method of claim 4, wherein said using the video samples of the second resolution as supervisory signals for each of the neural network branches comprises: for each neural network branch, generating a branch video to be supervised at the second resolution according to the output characteristics of the neural network branch, and supervising the branch video to be supervised at the second resolution by using the video sample at the second resolution.

6. A method for reconstructing video super-resolution is characterized by comprising the following steps:

receiving an original video to be super-resolved;

inputting the original video into the video super-resolution reconstruction model;

the video super-resolution reconstruction model comprises a feature fusion total module and at least two neural network branches, wherein each neural network branch corresponds to different scales, and each neural network branch comprises a feature extraction module and a feature fusion module; the feature extraction module is used for extracting features corresponding to the scale of the branch according to the input video; the characteristic fusion module is arranged for generating characteristics corresponding to the scale of the branch according to the characteristics of the branch and the characteristics of the adjacent low-scale branches, and taking the generated characteristics as the output characteristics of the branch; the feature fusion total module is configured to generate a super-resolution video according to the output features of the plurality of neural network branches.

7. The training device for the video super-resolution reconstruction model is characterized by comprising the following components:

the system comprises a sample data acquisition unit, a processing unit and a display unit, wherein the sample data acquisition unit is used for acquiring a plurality of sample data, each sample data comprises a video sample with a first resolution and a video sample with a second resolution, and the first resolution is smaller than the second resolution;

the model establishing unit is used for establishing a video super-resolution reconstruction model; the video super-resolution reconstruction model comprises a feature fusion total module and at least two neural network branches, wherein each neural network branch corresponds to different scales, and each neural network branch comprises a feature extraction module and a feature fusion module; the feature extraction module is used for extracting features corresponding to the scale of the branch according to the input video; the characteristic fusion module is arranged for generating characteristics corresponding to the scale of the branch according to the characteristics of the branch and the characteristics of the adjacent low-scale branches, and taking the generated characteristics as the output characteristics of the branch; the feature fusion total module is used for generating a super-resolution video according to the output features of the plurality of neural network branches;

and the training unit is used for training the video super-resolution reconstruction model by adopting the plurality of sample data.

8. The apparatus of claim 7, wherein the training unit comprises:

an input subunit, configured to use the video sample with the first resolution as an input of the video super-resolution reconstruction model;

and the supervision subunit is used for utilizing the video sample with the second resolution as a supervision signal of each neural network branch and utilizing the video sample with the second resolution as a supervision signal of the feature fusion total module to train the video super-resolution reconstruction model.

9. An apparatus for reconstructing video super-resolution, comprising:

an original video input unit, configured to input the original video into the video super-resolution reconstruction model;

10. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.