CN113724136A

CN113724136A - Video restoration method, device and medium

Info

Publication number: CN113724136A
Application number: CN202111039056.4A
Authority: CN
Inventors: 曾裕斌; 洪国伟; 董治; 雷兆恒
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2021-11-30

Abstract

The application discloses a video repair method, equipment and a medium, wherein the method comprises the following steps: acquiring a first training data set; extracting a first degradation representation of each first low-resolution video in the first training data set by using the trained feature extraction network; inputting the first degradation characteristic and the first low-resolution video into a pre-constructed super-resolution network to obtain a predicted high-resolution video; determining a first error between the predicted high-resolution video and the high-resolution video, and updating the super-resolution network based on the first error until the super-resolution network converges to obtain a trained super-resolution network; when the video to be restored is obtained, extracting a second degradation representation of the video to be restored by using the trained feature extraction network, and inputting the second degradation representation and the video to be restored into the trained super-resolution network to obtain the restored high-resolution video. Therefore, the video can be repaired aiming at the noise caused by different compression coding modes, and the video repairing effect is improved.

Description

Video restoration method, device and medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a video restoration method, device, and medium.

Background

The existing video quality improvement mainly uses algorithms such as noise reduction, deblurring, super-resolution and the like to improve the image quality and resolution of a video file. The noise reduction and deblurring algorithm mainly performs filtering noise reduction and Gaussian sharpening on the picture by means of low-pass filtering and the like. The super-resolution method aligns a plurality of video frames mainly by calculating optical flow, extracts features by using redundant information of adjacent video frames, and allows blurred parts in a current frame to be borrowed from the adjacent frames and used for restoring a high-resolution picture.

In the prior art, various distortion conditions possibly existing in real video data due to a compression coding mode, such as MPEG-2 and HEVC compression noise, are ignored, and the problem of poor video repair effect is caused.

Disclosure of Invention

In view of this, an object of the present application is to provide a video repair method, device and medium, which can perform corresponding repair on a video according to noise caused by different compression coding methods, so as to improve a video repair effect. The specific scheme is as follows:

in a first aspect, the present application discloses a video repair method, including:

acquiring a first training data set, wherein the first training data set comprises a plurality of first low-resolution videos and corresponding high-resolution videos;

extracting a first degradation representation of each first low-resolution video in the first training data set by using a trained feature extraction network, wherein the first degradation representation is used for representing a compression coding mode of the corresponding first low-resolution video;

inputting the first degradation characteristic and the first low-resolution video into a pre-constructed super-resolution network to obtain a predicted high-resolution video;

determining a first error between the predicted high-resolution video and the high-resolution video, and updating the super-resolution network based on the first error until the super-resolution network converges to obtain a trained super-resolution network;

when the video to be restored is obtained, extracting a second degradation representation of the video to be restored by using the trained feature extraction network, and inputting the second degradation representation and the video to be restored into the trained super-resolution network to obtain the restored high-resolution video.

Optionally, before the extracting, by using the trained feature extraction network, the first degradation characterization of each first low-resolution video in the first training data set, the method further includes:

acquiring second training set data, wherein the second training set data comprises a plurality of second low-resolution videos and corresponding degradation characterization labels, and the degradation characterization labels represent actual compression coding modes of the artificially marked second low-resolution videos;

inputting each second low-resolution video into a pre-constructed feature extraction network to obtain a prediction degradation representation of each second low-resolution video;

and determining a second error based on the prediction degradation characteristics and the degradation characteristic labels, and updating the feature extraction network based on the second error until the feature extraction network converges to obtain the trained feature extraction network.

Optionally, the inputting each second low-resolution video into a pre-constructed feature extraction network to obtain a prediction degradation characterization of each second low-resolution video includes:

decoding each of the second low resolution videos in the second training data set;

storing each second low-resolution video frame in each decoded second low-resolution video into a lossless picture format to obtain different second low-resolution video frame sequences, wherein one decoded second low-resolution video corresponds to one second low-resolution video frame sequence;

respectively inputting each second low-resolution video frame in each second low-resolution video frame sequence into a pre-constructed feature extraction network to obtain a prediction degradation representation of each second low-resolution video frame so as to obtain a prediction degradation representation of each second low-resolution video;

accordingly, determining a second error based on the predicted degradation characterization and the degradation characterization tag comprises:

determining a corresponding second error based on the predicted degradation characterization and the corresponding degradation characterization tag of each of the second low resolution video frames, respectively.

Optionally, determining a corresponding second error based on the predicted degradation characterization of any second low resolution video frame and the degradation characterization tag of the second low resolution video frame comprises:

when the feature extraction network is a triple Loss-based feature extraction network, determining a first parameter and a second parameter according to a degradation characterization tag corresponding to the second low-resolution video frame, wherein the first parameter is a degradation characterization of the video frame in the feature extraction network, which is the same as the compression coding mode of the second low-resolution video frame, and the second parameter represents a degradation characterization of the video frame in the feature extraction network, which is different from the compression coding mode of the second low-resolution video frame;

and determining the triple Loss of the second low-resolution video frame based on the predictive degradation characterization of the second low-resolution video frame, the first parameter and the second parameter.

Optionally, the convolutional layer in the feature extraction network adopts a deep separable convolution, and when data passes through the convolutional layer in the feature extraction network, point-by-point convolution is performed first, and after the point-by-point convolution is completed, channel-by-channel convolution is performed.

Optionally, the inputting the first degradation characteristic and the first low-resolution video into a pre-constructed super-resolution network to obtain a predicted high-resolution video includes:

dividing each first low resolution video frame sequence included in the first training data set into different first low resolution video frame subsequences by using a sliding window and a preset step length, wherein the number of video frames included in each first low resolution video frame subsequence is an odd number;

and sequentially inputting each first low-resolution video frame subsequence and the corresponding first degradation representation into a pre-constructed super-resolution network to obtain a predicted high-resolution video frame corresponding to each first low-resolution video frame.

Optionally, the sequentially inputting each first low-resolution video frame subsequence and the corresponding first degradation representation into a pre-constructed super-resolution network to obtain a predicted high-resolution video frame corresponding to each first low-resolution video frame includes:

inputting a currently input first low-resolution video frame subsequence into the super-resolution network;

performing implicit feature alignment on a currently input first low-resolution video frame subsequence through a feature alignment layer in the super-resolution network to obtain a corresponding aligned feature map;

splicing the aligned feature maps to obtain spliced feature maps;

inputting the spliced feature map and a target degradation representation into a super-resolution layer in the super-resolution network to obtain a first target high-resolution video frame corresponding to a target low-resolution video frame, wherein the target low-resolution video frame is the lowest-resolution video frame of a currently input first low-resolution video frame subsequence, and the target degradation representation is a first degradation representation of the target low-resolution video frame;

performing bicubic interpolation on the target low-resolution video frame to obtain a second target high-resolution video frame;

and adding the first target high-resolution video frame and the second target high-resolution video frame to obtain a predicted high-resolution video frame corresponding to the target low-resolution video frame.

Optionally, the inputting the spliced feature map and the target degradation representation into a super-resolution layer in the super-resolution network to obtain a first target high-resolution video frame corresponding to a target low-resolution video frame includes:

inputting the spliced feature map and the target degradation representation into a super-resolution layer in the super-resolution network;

pooling the spliced feature map through a pooling layer in the super-resolution layer to obtain a pooling result of the spliced feature map;

multiplying the pooling result by the target degradation characterization, and inputting the multiplied result to a full-connection network in the super-resolution layer to obtain a weight factor;

multiplying the weight factor by the pooling result to obtain an adjusted pooling result;

and carrying out deconvolution on the adjusted pooling result through a deconvolution layer in the super-resolution layer to obtain the first target high-resolution video frame.

Optionally, the acquiring a first training data set includes:

acquiring the high-resolution video;

performing bicubic downsampling on the high-resolution video to obtain the first low-resolution video;

and processing the first low-resolution video by using different compression coding modes, and taking the processed first low-resolution video and the processed high-resolution video as the first training data set.

In a second aspect, the present application discloses a video repair apparatus, comprising:

the data acquisition module is used for acquiring a first training data set, wherein the first training data set comprises a plurality of first low-resolution videos and corresponding high-resolution videos;

a degradation feature extraction module, configured to extract, by using a trained feature extraction network, a first degradation representation of each first low-resolution video in the first training data set, where the first degradation representation is used to represent a compression encoding mode of a corresponding first low-resolution video;

the training module is used for inputting the first degradation characteristic and the first low-resolution video into a pre-constructed super-resolution network to obtain a predicted high-resolution video; determining a first error between the predicted high-resolution video and the high-resolution video, and updating the super-resolution network based on the first error until the super-resolution network converges to obtain a trained super-resolution network;

and the video restoration module is used for extracting a second degradation representation of the video to be restored by using the trained feature extraction network when the video to be restored is obtained, and inputting the second degradation representation and the video to be restored into the trained super-resolution network to obtain a restored high-resolution video.

In a third aspect, the present application discloses an electronic device, comprising:

a memory and a processor;

wherein the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the video repair method disclosed in the foregoing.

In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the video repair method disclosed above.

As can be seen, a first training data set is first obtained in the present application, where the first training data set includes a plurality of first low-resolution videos and corresponding high-resolution videos. And then extracting a first degradation representation of each first low-resolution video in the first training data set by using the trained feature extraction network, wherein the first degradation representation is used for representing a compression coding mode of the corresponding first low-resolution video. Inputting the first degradation characteristic and the first low-resolution video into a pre-constructed super-resolution network to obtain a predicted high-resolution video; and determining a first error between the predicted high-resolution video and the high-resolution video, and updating the super-resolution network based on the first error until the super-resolution network converges to obtain the trained super-resolution network. When the video to be restored is obtained, extracting a second degradation representation of the video to be restored by using the trained feature extraction network, and inputting the second degradation representation and the video to be restored into the trained super-resolution network to obtain the restored high-resolution video. Therefore, when the super-resolution network performs super-resolution processing on the low-resolution video, the degradation representation of the compression coding mode of the low-resolution video needs to be combined for processing, so that the super-resolution network can adapt to noise caused by different compression coding modes, the video is correspondingly repaired, and the video repairing effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of a system framework to which the video repair scheme provided herein is applicable;

FIG. 2 is a flow chart of a video repair method disclosed in the present application;

FIG. 3 is a flow chart of a training data set generation disclosed herein;

FIG. 4 is a flow chart of a specific video repair disclosed herein;

FIG. 5 is a partial flow diagram of a particular video repair method disclosed herein;

FIG. 6 is a flow chart of a depth separable convolution as disclosed herein;

FIG. 7 is a partial flow diagram of a particular video repair method disclosed herein;

FIG. 8 is a schematic diagram of a specific super-resolution network disclosed in the present application;

FIG. 9 is a schematic diagram of a specific super-resolution network disclosed in the present application;

FIG. 10 is a diagram illustrating video repair effects disclosed herein;

fig. 11 is a schematic structural diagram of a video repair apparatus according to the present disclosure;

fig. 12 is a schematic structural diagram of an electronic device disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, algorithms such as noise reduction, deblurring, super-resolution and the like are mainly used for improving the image quality and resolution of a video file. The noise reduction and deblurring algorithm mainly performs filtering noise reduction and Gaussian sharpening on the picture by means of low-pass filtering and the like. The super-resolution method aligns a plurality of video frames mainly by calculating optical flow, extracts features by using redundant information of adjacent video frames, and allows blurred parts in a current frame to be borrowed from the adjacent frames and used for restoring a high-resolution picture. However, in the above prior art, various distortion conditions possibly existing in real video data due to a compression coding method, such as MPEG-2 and HEVC compression noise, are ignored, and thus the video repair effect is not good. In view of this, the present application provides a video repair method, which performs corresponding repair on a video according to noise caused by different compression coding modes, so as to improve a video repair effect.

In the video repair scheme of the present application, the adopted system framework may specifically refer to fig. 1, and may specifically include: the system comprises a background server and a plurality of user terminals which are in communication connection with the background server. The user side includes, but is not limited to, a tablet computer, a notebook computer, a smart phone, and a Personal Computer (PC), and is not limited herein.

In the present application, a background server executes a video repair method, including obtaining a first training data set, where the first training data set includes a plurality of first low-resolution videos and corresponding high-resolution videos; extracting a first degradation representation of each first low-resolution video in the first training data set by using a trained feature extraction network, wherein the first degradation representation is used for representing a compression coding mode of the corresponding first low-resolution video; inputting the first degradation characteristic and the first low-resolution video into a pre-constructed super-resolution network to obtain a predicted high-resolution video; and determining a first error between the predicted high-resolution video and the high-resolution video, and updating the super-resolution network based on the first error until the super-resolution network converges to obtain the trained super-resolution network. After the background server finishes the super-resolution network training to obtain the trained super-resolution network, the trained feature extraction network and the trained super-resolution network can be sent to each user terminal, when each user terminal obtains a video to be restored, the trained feature extraction network is used for extracting a second degradation representation of the video to be restored, and the second degradation representation and the video to be restored are input into the trained super-resolution network to obtain the restored high-resolution video.

Referring to fig. 2, an embodiment of the present application discloses a video repair method, including:

step S11: a first training data set is obtained, wherein the first training data set includes a plurality of first low resolution videos and corresponding high resolution videos.

In an actual implementation process, a first training data set needs to be acquired first, where the first training data set includes a plurality of first low-resolution videos and corresponding high-resolution videos.

The acquiring of the first training data set may specifically include: acquiring the high-resolution video; performing bicubic downsampling on the high-resolution video to obtain the first low-resolution video; and processing the first low-resolution video by using different compression coding modes, and taking the processed first low-resolution video and the processed high-resolution video as the first training data set.

That is, in an actual application process, a high-resolution video may be obtained first, then bicubic downsampling may be performed on the high-resolution video, so as to obtain a first low-resolution video, the obtained first low-resolution video is processed by using different compression coding methods, and the processed first low-resolution video and the processed high-resolution video are used as a first training data set, where the compression coding methods include, but are not limited to, MPEG (Moving Picture Experts Group), h.264, and h.265.

The current computer may directly acquire the first training data set processed by another computer, and the other computer may acquire the first training data set obtained through the foregoing steps, or may acquire data transmitted by another computer, which is not specifically limited herein. Of course, the current computer may also directly acquire a high-resolution video, then perform bicubic downsampling on the acquired high-resolution video to obtain a first low-resolution video, then process the obtained first low-resolution video by using different compression coding methods, and use the processed first low-resolution video and the processed high-resolution video as a first training data set.

Referring to FIG. 3, a flow chart is generated for the first training data set. Firstly, a high-resolution video is obtained, then bicubic downsampling is carried out on the obtained high-resolution video, so that a first low-resolution video is obtained, video coding compression is carried out on the obtained first low-resolution video by utilizing different compression coding modes, and the processed first low-resolution video and the processed high-resolution video serve as a first training data set.

Step S12: and extracting a first degradation representation of each first low-resolution video in the first training data set by using the trained feature extraction network, wherein the first degradation representation is used for representing a compression coding mode of the corresponding first low-resolution video.

After the first training data set is obtained, a trained feature extraction network is further used to extract a first degradation representation of each first low-resolution video in the first training data set, where the first degradation representation is used to represent a compression coding mode of a corresponding first low-resolution video.

Namely, the first degradation representation of the first low-resolution video is extracted by using the trained feature extraction network obtained by pre-training, so that the compression coding mode of the first low-resolution video is determined.

Step S13: and inputting the first degradation characteristic and the first low-resolution video into a pre-constructed super-resolution network to obtain a predicted high-resolution video.

After obtaining the first degradation characterization, the first degradation characterization and the first low resolution video may be input to a pre-constructed super-resolution network to obtain a predicted high resolution video for training the super-resolution network, wherein a convolutional layer in the super-resolution network may employ a depth separable convolution.

When the super-resolution network in the sample application performs the super-resolution operation on the video frames in the first training data set, the specific operation is also performed according to the degradation characterizations which characterize different compression coding modes, so that the noise caused by the different compression coding modes can be processed, and the influence caused by the noise can be reduced. And the convolution layer in the super-resolution network can adopt deep separable convolution, compared with general convolution operation, the deep separable convolution operation can greatly reduce the calculation amount in the convolution process, so that the super-resolution network is light in weight, the calculation resource is saved, the video restoration efficiency is improved, and the cost can also be saved.

Step S14: and determining a first error between the predicted high-resolution video and the high-resolution video, and updating the super-resolution network based on the first error until the super-resolution network converges to obtain the trained super-resolution network.

After the super-resolution network obtains the predicted high-resolution video, a first error between the predicted high-resolution video and a high-resolution video corresponding to the first low-resolution video is determined, and the super-resolution network is updated based on the first error until the super-resolution network converges to obtain a trained super-resolution network.

Step S15: when the video to be restored is obtained, extracting a second degradation representation of the video to be restored by using the trained feature extraction network, and inputting the second degradation representation and the video to be restored into the trained super-resolution network to obtain the restored high-resolution video.

After the post-training feature extraction network and the post-training super-resolution network are obtained, when a video to be restored is obtained, the post-training special diagnosis extraction network can be used for extracting a second degradation representation of the video to be restored, then the second degradation representation and the video to be restored are input into the post-training super-resolution network, and the post-training super-resolution network can output the restored high-resolution video.

In particular, see FIG. 4. And inputting a video sequence to be restored into the trained feature extraction network, wherein the trained feature extraction network outputs degradation representations corresponding to all video frames in the video, and then the degradation representations and the corresponding videos are input into the trained super-resolution network, so that the trained super-resolution network can output the restored high-resolution video.

Referring to fig. 5, before extracting the first degradation characterization of each low-resolution video in the training data set by using the trained feature extraction network, the method further includes:

step S21: and acquiring second training set data, wherein the second training set data comprises a plurality of second low-resolution videos and corresponding degradation characterization labels, and the degradation characterization labels represent the actual compression coding modes of the artificially marked second low-resolution videos.

That is, before extracting the first degradation characterization of each low-resolution video in the training data set by using the post-training feature extraction network, the post-training feature extraction network needs to be trained by using corresponding data, so that a second training data set needs to be obtained first, where the second training data set includes a plurality of second low-resolution videos and corresponding degradation characterization labels, and the degradation characterization labels represent actual compression encoding modes of the artificially labeled second low-resolution videos, and the second low-resolution videos may be videos processed by using different compression encoding modes.

In an actual application process, the first training data set and the second training data set may be different or the same, and when the first training data set and the second training data set are the same, the training data set includes a low-resolution video, a high-resolution video corresponding to the low-resolution video, and a degradation characterization tag corresponding to the low-resolution video. The degradation characterization tag is a tag which is printed on the second low-resolution video and represents a compression coding mode when the second training data set is constructed, for example, if the compression coding mode of the second low-resolution video is mode 1, the degradation characterization tag is 1, and the degradation characterization is a vector which is extracted from the low-resolution video through a feature extraction network and can represent the compression coding mode of the low-resolution video. The degradation characterization labels of two low-resolution video frames are the same, and the degradation characterizations extracted from the two low-resolution video frames may be different and slightly different.

Step S22: and inputting each second low-resolution video into a pre-constructed feature extraction network to obtain the prediction degradation representation of each second low-resolution video.

After the second training data set is obtained, each second low-resolution video can be input to a pre-constructed feature extraction network, so as to obtain a prediction degradation characterization of each second low-resolution video.

Specifically, inputting each second low-resolution video into a pre-constructed feature extraction network to obtain a prediction degradation characterization of each second low-resolution video, including: decoding each of the second low resolution videos in the second training data set; storing each second low-resolution video frame in each decoded second low-resolution video into a lossless picture format to obtain different second low-resolution video frame sequences, wherein one decoded second low-resolution video corresponds to one second low-resolution video frame sequence; and respectively inputting each second low-resolution video frame in each second low-resolution video frame sequence into a pre-constructed feature extraction network for training to obtain a prediction degradation representation of each second low-resolution video frame so as to obtain the prediction degradation representation of each second low-resolution video.

Since the second low-resolution video in the second training data set may be a low-resolution video processed in different compression encoding manners, it is necessary to decode each low-resolution video in the second training data set to obtain a decoded second low-resolution video.

After the decoded second low-resolution videos are obtained, it is also necessary to store each second low-resolution video frame in each decrypted second low-resolution video in a lossless picture format to obtain different second low-resolution video frame sequences, where one decoded second low-resolution video corresponds to one second low-resolution video frame sequence.

Because the second training data set comprises different segments of compressed and encoded second low-resolution videos, different segments of decoded second low-resolution videos can be correspondingly obtained after decoding, and the second low-resolution video frames in each segment of decoded second low-resolution videos are stored into a lossless picture format, so that different second low-resolution video frame sequences are obtained.

After each second low-resolution video frame sequence is obtained, inputting each second low-resolution video frame in each second low-resolution video frame sequence into a pre-constructed feature extraction network respectively to obtain a prediction degradation representation of each second low-resolution video frame so as to obtain the prediction degradation representation of each second low-resolution video.

The convolution layer in the feature extraction network adopts deep separable convolution, point-by-point convolution is firstly carried out when data passes through the convolution layer in the feature extraction network, and channel-by-channel convolution is carried out after the point-by-point convolution is completed.

In an actual implementation process, when the convolution layer in the feature extraction network adopts the depth separable convolution, a mode of firstly performing channel-by-channel convolution, then performing point-by-point convolution, and then activating can be adopted. Referring to fig. 6, a point-by-point convolution, a channel-by-channel convolution and a reactivation may also be used, which may further reduce the amount of computation compared to a channel-by-channel convolution, a channel-by-point convolution and a reactivation.

Compared with the general convolution operation, the deep separable convolution operation can greatly reduce the calculation amount in the convolution process, so that the feature extraction network is light in weight, the calculation resource is saved, the video repair efficiency is improved, and the cost is also saved.

Step S23: and determining a second error based on the prediction degradation characteristics and the degradation characteristic labels, and updating the feature extraction network based on the second error until the feature extraction network converges to obtain the trained feature extraction network.

After the predicted degradation characterization is obtained, a second error between the predicted degradation characterization and a degradation characterization label needs to be determined, and the feature extraction network needs to be updated based on the second error until the feature extraction network converges, so as to obtain the trained feature extraction network.

Specifically, a corresponding second error is determined based on the predicted degradation characterization of each second low-resolution video frame and the corresponding degradation characterization tag, and the feature extraction network is updated based on the second error until the feature extraction network converges, so as to obtain the trained feature extraction network.

When the feature extraction network is a feature extraction network based on triple Loss, determining a corresponding second error based on the predicted degradation characterization of any second low-resolution video frame and the degradation characterization tag of the second low-resolution video frame, including: determining a first parameter and a second parameter according to a degradation characterization tag corresponding to the second low-resolution video frame, wherein the first parameter is a degradation characterization of a video frame in the feature extraction network, the video frame having the same compression coding mode as the second low-resolution video frame, and the second parameter represents a degradation characterization of a video frame in the feature extraction network, the video frame having a different compression coding mode from the second low-resolution video frame; and determining the triple Loss of the second low-resolution video frame based on the predictive degradation characterization of the second low-resolution video frame, the first parameter and the second parameter.

The calculation formula of the triple Loss is as follows:

wherein x is the input second low resolution video frame, d is expressed as a one-dimensional vector for expressing the prediction degradation characterization of x output by the feature extraction network. p is the same compression mode as the second low resolution of the current input, n is a compression mode different from the second low resolution of the input, d_pRepresenting, for said first parameter, a degradation characterization of the video frame in said feature extraction network in the same way as the compression coding of the second low resolution video frame, d_nAnd representing the degradation representation of the video frame with the degradation mode different from that of the current video frame in the feature extraction network for the second parameter, wherein alpha is a constant. Triple loss Triplet LossL_triThe aim of (1) is to minimize the Euclidean distance of the degradation characterization of video adopting similar coding mode or similar compression strength and maximize the Euclidean distance of the degradation characterization of video data with other different coding modes.

That is, the degradation characterization of the low-resolution video frame whose degradation characterization label extracted by the feature extraction network is the same as the degradation characterization label corresponding to the second low-resolution video frame is determined, and the degradation characterization of the low-resolution video frame is used to determine the first parameter d_p。

Then determining the degradation representation of the low-resolution video frame with the degradation representation label extracted by the feature extraction network different from the degradation representation label corresponding to the second low-resolution video frame, and determining a second parameter d by using the degradation representation of the part of the low-resolution video frame_n。

Then, based on the characterization of the predictive degradation of the second low resolution video frame, said first parameter d_pThe second parameter d_nAnd determining the triple Loss Triplet Loss of the second low-resolution video frame by the calculation formula of the triple Loss Triplet Loss.

Referring to fig. 7, inputting the first degradation feature and the first low resolution video into a pre-constructed super-resolution network to obtain a predicted high resolution video, includes:

step S31: dividing each first low resolution video frame sequence included in the first training data set into different first low resolution video frame subsequences by using a sliding window and a preset step size, wherein the number of video frames included in each first low resolution video frame subsequence is an odd number.

When the first degradation characteristic and the first training data set are used for training a pre-constructed super-resolution network, firstly, a sliding window and a preset step length are used for dividing each first low-resolution video frame sequence included in the first training data set into different first low-resolution video frame subsequences, wherein the number of video frames included in each first low-resolution video frame subsequence is an odd number, and the preset step length can be determined according to actual conditions and generally can be 1. Since the high resolution video frame of the middle-most video frame is output after a first low resolution video frame subsequence is input into the super resolution network, the first low resolution video frame subsequence of a first low resolution video sequence can be a first frame low resolution video frame, a second frame low resolution video frame, the second low resolution video frame subsequence can be a first frame low resolution video frame, a second frame low resolution video frame, a third frame low resolution video frame, and the last low resolution video frame subsequence can be a last frame low resolution video frame, and a last frame low resolution video frame.

The process of obtaining the first sequence of low resolution video frames may be: decoding each of the first low resolution videos in the first training data set; and storing each first low-resolution video frame in each decoded first low-resolution video into a lossless picture format to obtain different first low-resolution video frame sequences, wherein one decoded first low-resolution video corresponds to one first low-resolution video frame sequence.

In practical applications, the first low resolution video frames with excessive motion amplitude and the first low resolution video frames with excessive picture unsharpness in the first low resolution video frame sequence may be excluded for training.

Step S32: and sequentially inputting each first low-resolution video frame subsequence and the corresponding first degradation representation into a pre-constructed super-resolution network to obtain a predicted high-resolution video frame corresponding to each first low-resolution video frame.

Then, each first low-resolution video frame subsequence and the corresponding first degradation representation are sequentially input to a pre-constructed super-resolution network to obtain a predicted high-resolution video frame corresponding to each first low-resolution video frame.

Specifically, a currently input first low-resolution video frame subsequence is input to the super-resolution network; performing implicit feature alignment on a currently input first low-resolution video frame subsequence through a feature alignment layer in the super-resolution network to obtain a corresponding aligned feature map; splicing the aligned feature maps to obtain spliced feature maps; inputting the spliced feature map and a target degradation representation into a super-resolution layer in the super-resolution network to obtain a first target high-resolution video frame corresponding to a target low-resolution video frame, wherein the target low-resolution video frame is the lowest-resolution video frame of a currently input first low-resolution video frame subsequence, and the target degradation representation is a first degradation representation of the target low-resolution video frame; performing bicubic interpolation on the target low-resolution video frame to obtain a second target high-resolution video frame; and adding the first target high-resolution video frame and the second target high-resolution video frame to obtain a predicted high-resolution video frame corresponding to the target low-resolution video frame.

That is, referring to fig. 8, after a first sub-sequence of low resolution video frames is input into the super resolution network, the sub-sequence passes through a Feature Alignment layer (Feature Alignment) pair in the super resolution networkImplicit feature alignment is performed on a first currently input sub-sequence of low resolution video frames, e.g., the currently input sub-sequence of low resolution video frames is/_t-1、l_t、l_t+1Then will l_t-1、l_tCarrying out implicit characteristic alignment to obtain a corresponding aligned characteristic diagram h_t-1(ii) a Will l_t、l_tCarrying out implicit characteristic alignment to obtain a corresponding aligned characteristic diagram h_t(ii) a Will l_t+1、l_tCarrying out implicit characteristic alignment to obtain a corresponding aligned characteristic diagram h_t+1. Then, the aligned feature maps are spliced (Concat), that is, the aligned feature maps are superimposed on the channel to obtain a spliced feature map, and then the spliced feature map and the target degradation characterization (for example, the subsequence of the upper low-resolution video frame is l)_t-1、l_t、l_t+1Of (1)_tDegradation characterization of) is input into a super-resolution layer in the super-resolution network, resulting in a target low-resolution video frame (e.g., with a sub-sequence of l above low-resolution video frames)_t-1、l_t、l_t+1Of (1)_t) A corresponding first target high resolution video frame, and then for said target low resolution video frame (e.g., the upper low resolution video frame sub-sequence is/_t-1、l_t、l_t+1Of (1)_t) And carrying out bicubic (bicubic) interpolation to obtain a second target high-resolution video frame. And then adding the pixel values of the corresponding positions of the first target high-resolution video frame and the second target high-resolution video frame to obtain a predicted high-resolution video frame corresponding to the target low-resolution video frame. Determining a first error between a high-resolution video frame corresponding to the target low-resolution video frame and the predicted high-resolution video frame according to a preset formula, and updating parameters of the feature alignment layer and the super-resolution layer according to the first error until the super-resolution network converges to obtain the trained super-resolution network. Wherein the preset formula is as follows:

wherein F is the defined super-resolution network, theta is the parameter of the super-resolution network,

for the output result of the super-resolution network, i.e. the predicted high-resolution video frame, I denotes the high-resolution video frame corresponding to the input low-resolution video frame x, I denotes the pixel position in the image, and N denotes the total number of pixels.

Inputting the spliced feature map and the target degradation representation into a super-resolution layer in the super-resolution network to obtain a first target high-resolution video frame corresponding to a target low-resolution video frame, wherein the method comprises the following steps: inputting the spliced feature map and the target degradation representation into a super-resolution layer in the super-resolution network; pooling the spliced feature map through a pooling layer in the super-resolution layer to obtain a pooling result of the spliced feature map; multiplying the pooling result by the target degradation characterization, and inputting the multiplied result to a full-connection network in the super-resolution layer to obtain a weight factor; multiplying the weight factor by the pooling result to obtain an adjusted pooling result; and carrying out deconvolution on the adjusted pooling result through a deconvolution layer in the super-resolution layer to obtain the first target high-resolution video frame.

Referring to fig. 9, after the spliced feature map and the target degradation characterization are input into a super-resolution layer in the super-resolution network, pooling is performed on the spliced feature map through a pooling layer, where the number of pooling times may be determined according to an actual situation, and is 3 times of pooling in the figure, so as to obtain a pooling result of the spliced feature map, where the pooling result is 1 × W, and then the pooling result is multiplied by the target degradation characterization (also 1 × 1W) and input into a fully-connected network in the super-resolution layer, where the number of layers of the fully-connected network may be determined according to the actual situation, so as to obtain a weight factor. Specifically, the pooling result is multiplied by the target degradation characterization and then input to a full-connection network in the super-resolution layer, and the full-connection network correspondingly processes the result obtained by multiplying the pooling result by the target degradation characterization, and outputs the result after excitation to obtain a weight factor (also 1 × W). The weight factor is then multiplied by the pooling result to obtain an adjusted pooling result (also 1 x W). And then, carrying out deconvolution on the adjusted pooling result by a deconvolution layer to obtain the first target high-resolution video frame. The part of the figure before the deconvolution arrow is constructed based on the U-net network, and in order to ensure that the resolution of the first target high-resolution video frame is higher than that of the input image, a subsequent deconvolution operation is added later.

The convolution layer in the super-resolution network adopts depth separable convolution, point-by-point convolution is firstly carried out when data passes through the convolution layer in the feature extraction network, and channel-by-channel convolution is carried out after the point-by-point convolution is completed.

Referring to fig. 10, from left to right, there are RBPN (cyclic Back-Projection Network), SRFlow, and the aforementioned video repair method in this application, respectively, for the processing effect of MPEG compressed video. It can be seen that the image restored by the aforementioned video restoration method in the present application is clearer.

A first training data set may be obtained, where the first training data set includes a plurality of first low-resolution videos and corresponding high-resolution videos; extracting a first degradation representation of each first low-resolution video in the first training data set by using a trained feature extraction network, wherein the first degradation representation is used for representing a compression coding mode of the corresponding first low-resolution video; inputting the first degradation characteristic and the first low-resolution video into a pre-constructed super-resolution network to obtain a predicted high-resolution video; and determining a first error between the predicted high-resolution video and the high-resolution video, and updating the super-resolution network based on the first error until the super-resolution network converges to obtain the trained super-resolution network. And then storing the trained feature extraction network and the trained super-resolution network to a video processing APP, so that when the video processing APP acquires a video to be restored, a second degradation representation of the video to be restored is extracted by using the trained feature extraction network, the second degradation representation and the video to be restored are input to the trained super-resolution network, a restored high-resolution video is obtained, and the restored high-resolution video is presented to a user.

Referring to fig. 11, an embodiment of the present application discloses a video repair apparatus, including:

the data acquisition module 11 is configured to acquire a training data set, where the training data set includes low-resolution videos processed in different compression coding manners;

a degradation feature extraction module 12, configured to extract, by using a trained feature extraction network, a first degradation feature of each first low-resolution video in the first training data set, where the first degradation feature is used to represent a compression encoding mode of a corresponding first low-resolution video;

the training module 13 is configured to input the first degradation feature and the first low-resolution video to a pre-constructed super-resolution network to obtain a predicted high-resolution video; determining a first error between the predicted high-resolution video and the high-resolution video, and updating the super-resolution network based on the first error until the super-resolution network converges to obtain a trained super-resolution network;

and the video restoration module 14 is configured to, when a video to be restored is acquired, extract a second degradation representation of the video to be restored by using the trained feature extraction network, and input the second degradation representation and the video to be restored into the trained super-resolution network to obtain a restored high-resolution video.

As can be seen, a first training data set is first obtained in the present application, where the first training data set includes a plurality of first low-resolution videos and corresponding high-resolution videos. And then extracting a first degradation representation of each low-resolution video in the first training data set by using the trained feature extraction network, wherein the first degradation representation is used for representing a compression coding mode of the corresponding first low-resolution video. Inputting the first degradation characteristic and the first low-resolution video into a pre-constructed super-resolution network to obtain a predicted high-resolution video; and determining a first error between the predicted high-resolution video and the high-resolution video, and updating the super-resolution network based on the first error until the super-resolution network converges to obtain the trained super-resolution network. When the video to be restored is obtained, extracting a second degradation representation of the video to be restored by using the trained feature extraction network, and inputting the second degradation representation and the video to be restored into the trained super-resolution network to obtain the restored high-resolution video. Therefore, when the super-resolution network performs super-resolution processing on the low-resolution video, the degradation representation of the compression coding mode of the low-resolution video needs to be combined for processing, so that the super-resolution network can adapt to noise caused by different compression coding modes, the video is correspondingly repaired, and the video repairing effect is improved.

In a specific implementation process, the data obtaining module 11 is further configured to: acquiring second training set data, wherein the second training set data comprises a plurality of second low-resolution videos and corresponding degradation characterization labels, and the degradation characterization labels represent actual compression coding modes of the artificially marked second low-resolution videos;

correspondingly, the training module 13 is further configured to: inputting each second low-resolution video into a pre-constructed feature extraction network to obtain a prediction degradation representation of each second low-resolution video; and determining a second error based on the prediction degradation characteristics and the degradation characteristic labels, and updating the feature extraction network based on the second error until the feature extraction network converges to obtain the trained feature extraction network.

In a specific implementation process, the training module 13 is configured to:

respectively inputting each second low-resolution video frame in each second low-resolution video frame sequence into a pre-constructed feature extraction network for training to obtain a prediction degradation representation of each second low-resolution video frame so as to obtain a prediction degradation representation of each second low-resolution video;

In a specific implementation process, the training module 13 is configured to:

In a specific implementation process, the convolution layer in the feature extraction network adopts deep separable convolution, point-by-point convolution is firstly carried out when data passes through the convolution layer in the feature extraction network, and channel-by-channel convolution is carried out after the point-by-point convolution is finished.

In a specific implementation process, the training module 13 is configured to:

splicing the aligned feature maps to obtain spliced feature maps;

In a specific implementation process, the training module 13 is configured to:

In a specific implementation process, the data obtaining module 11 is configured to:

acquiring the high-resolution video;

Fig. 12 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure, where the electronic device 20 may specifically include, but is not limited to, a notebook computer, a desktop computer, a server, and the like.

In general, the electronic device 20 in the present embodiment includes: a processor 21 and a memory 22.

The processor 21 may include one or more processing cores, such as a four-core processor, an eight-core processor, and so on. The processor 21 may be implemented by at least one hardware of a DSP (digital signal processing), an FPGA (field-programmable gate array), and a PLA (programmable logic array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (graphics processing unit) which is responsible for rendering and drawing images to be displayed on the display screen. In some embodiments, the processor 21 may include an AI (artificial intelligence) processor for processing computing operations related to machine learning.

Memory 22 may include one or more computer-readable storage media, which may be non-transitory. Memory 22 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 22 is at least used for storing the following computer program 221, wherein after being loaded and executed by the processor 21, the steps of the video repair method disclosed in any of the foregoing embodiments can be implemented.

In some embodiments, the electronic device 20 may further include a display 23, an input/output interface 24, a communication interface 25, a sensor 26, a power supply 27, and a communication bus 28.

Those skilled in the art will appreciate that the configuration shown in fig. 12 is not limiting to electronic device 20 and may include more or fewer components than those shown.

Further, an embodiment of the present application also discloses a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the video repair method disclosed in any of the foregoing embodiments.

For the specific process of the video repair method, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of other elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The video repair method, device and medium provided by the present application are introduced in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of video repair, comprising:

2. The method of claim 1, wherein before extracting the first degradation characterization of each first low-resolution video in the training data set using the trained feature extraction network, the method further comprises:

3. The method according to claim 2, wherein said inputting each of the second low resolution videos into a pre-constructed feature extraction network to obtain a predictive degradation characterization of each of the second low resolution videos comprises:

4. The method of claim 3, wherein determining the corresponding second error based on the predicted degradation characterization of any second low resolution video frame and the degradation characterization tag of the second low resolution video frame comprises:

5. The video repair method of claim 2, wherein the convolutional layers in the feature extraction network are subjected to deep separable convolution, and point-by-point convolution is performed when data passes through the convolutional layers in the feature extraction network, and after the point-by-point convolution is completed, channel-by-channel convolution is performed.

6. The method of video restoration according to claim 1, wherein said inputting said first degradation characterization and said first low resolution video into a pre-constructed super resolution network resulting in a predicted high resolution video comprises:

7. The method according to claim 6, wherein said sequentially inputting each of the first low-resolution video frame sub-sequences and the corresponding first degradation characterizations into a pre-constructed super-resolution network to obtain a predicted high-resolution video frame corresponding to each of the first low-resolution video frames comprises:

splicing the aligned feature maps to obtain spliced feature maps;

8. The method according to claim 7, wherein the inputting the stitched feature map and the target degradation characterization into a super-resolution layer in the super-resolution network to obtain a first target high-resolution video frame corresponding to a target low-resolution video frame comprises:

9. The video repair method of any of claims 1 to 8, wherein the obtaining the first training data set comprises:

acquiring the high-resolution video;

10. An electronic device, comprising:

a memory and a processor;

wherein the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the video repair method according to any one of claims 1 to 9.

11. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the video repair method of any of claims 1 to 9.