CN113592719B

CN113592719B - Training method of video super-resolution model, video processing method and corresponding equipment

Info

Publication number: CN113592719B
Application number: CN202110933607.5A
Authority: CN
Inventors: 磯部駿; 陶鑫; 章佳杰; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-08-14
Filing date: 2021-08-14
Publication date: 2023-11-28
Anticipated expiration: 2041-08-14
Also published as: CN113592719A

Abstract

The disclosure provides a training method of a video super-resolution model, a video processing method and corresponding equipment. The video processing method comprises the following steps: inputting the image characteristics of the current image frame, the image characteristics related to the next image frame of the current image frame and the first prediction characteristics of the current image frame into a forward enhancement network for each image frame in the video to obtain the first enhancement characteristics of the current image frame and the first prediction characteristics of the next image frame of the predicted current image frame; inputting the first enhancement feature of the current image frame, the image feature related to the previous image frame of the current image frame and the second prediction feature of the current image frame into a backward enhancement network to obtain the second enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame; and obtaining a super-resolution image of the current image frame based on the second enhancement feature of the current image frame and the current image frame.

Description

Training method of video super-resolution model, video processing method and corresponding equipment

Technical Field

The present disclosure relates generally to the field of video processing technology, and more particularly, to a training method of a video super-resolution model, a video processing method and corresponding devices.

Background

With the development of video technology, the use of video is an integral part of people's daily life. Video-based super resolution algorithms are therefore also receiving extensive attention from both academia and industry. In video monitoring, a low-definition video sequence can be used for amplifying pedestrians and license plates in an observation picture after super division; in video transmission, the video can be degraded and transmitted at low resolution to save cost, and then restored in a superdivision manner; in ultra-high definition display, the video super-resolution algorithm can improve the quality of a low-quality film source so as to improve the visual experience of a user.

Although image super-resolution algorithms have improved in sudden motion with the aid of deep learning, the application of image algorithms directly to video tasks is still not ideal. The omission of timing information results in artifacts and inter-frame flicker in the pictures of the super-resolution video.

Disclosure of Invention

Exemplary embodiments of the present disclosure provide a training method for a video super-resolution model, a video processing method and corresponding devices, so as to solve at least the problems in the related art, or not solve any of the problems.

According to a first aspect of an embodiment of the present disclosure, there is provided a video processing method, including: inputting the image characteristics of the current image frame, the image characteristics related to the next image frame of the current image frame and the first prediction characteristics of the current image frame into a forward enhancement network for each image frame in the video to obtain the first enhancement characteristics of the current image frame and the first prediction characteristics of the next image frame of the predicted current image frame; inputting the first enhancement feature of the current image frame, the image feature related to the previous image frame of the current image frame and the second prediction feature of the current image frame into a backward enhancement network to obtain the second enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame; obtaining a super-resolution image of the current image frame based on the second enhancement feature of the current image frame and the current image frame; wherein the first predicted feature of the current image frame is an image feature of the current image frame predicted when the forward enhancement network is used for a previous image frame of the current image frame; the second predicted feature of the current image frame is an image feature of the current image frame predicted when the backward enhancement network is used for a next image frame to the current image frame.

Optionally, the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution; the image features associated with the previous image frame of the current image frame are: super-resolved image features of the previous image frame of the current image frame or features of the current image frame and the previous image frame after the inter-frame information between the current image frame and the previous image frame is enhanced; the image features associated with the next image frame to the current image frame are: super-resolved image features of the next image frame of the current image frame or features of the current image frame and the next image frame after the inter-frame information between the current image frame and the next image frame is enhanced.

Optionally, the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution; the image features associated with the previous image frame of the current image frame are: super-resolved image features of a difference image between the current image frame and the previous image frame or super-resolved image features of an optical flow image between the current image frame and the previous image frame; the image features associated with the next image frame to the current image frame are: super-resolved image features of a difference map between a current image frame and a next image frame or super-resolved image features of an optical flow map between the current image frame and the next image frame.

Optionally, the video processing method further includes: based on the current image frame, the previous image frame of the current image frame and the next image frame of the current image frame, a splicing vector corresponding to the current image frame is obtained; and inputting the spliced vector corresponding to the current image frame into a feature extraction network to obtain the image features of the current image frame, the image features related to the previous image frame of the current image frame and the image features related to the next image frame of the current image frame.

Optionally, the step of obtaining the stitching vector corresponding to the current image frame based on the current image frame, the previous image frame of the current image frame, and the next image frame of the current image frame includes: splicing the current image frame, the inter-frame information between the current image frame and the previous image frame and the inter-frame information between the current image frame and the next image frame to obtain a splicing vector corresponding to the current image frame; or, the current image frame, the image frame which is the last image frame of the current image frame and the image frame which is the next image frame of the current image frame are spliced to obtain the splicing vector corresponding to the current image frame.

Optionally, the feature extraction network includes: the method comprises the steps of merging a network, a first feature network, a second feature network and a third feature network, wherein a spliced vector corresponding to a current image frame is input into the feature extraction network to obtain image features of the current image frame, image features related to a previous image frame of the current image frame and image features related to a next image frame of the current image frame, and the steps of merging the image features of the current image frame and the image features related to a next image frame of the current image frame comprise the following steps: inputting the spliced vector corresponding to the current image frame into the fusion network to obtain the fusion vector corresponding to the current image frame; and respectively inputting the fusion vector corresponding to the current image frame into the first feature network, the second feature network and the third feature network to obtain the image feature which is output by the first feature network and is related to the next image frame of the current image frame, the image feature which is output by the second feature network and is related to the last image frame of the current image frame, and the image feature which is output by the third feature network.

Optionally, the first feature network, the second feature network, and the third feature network are unidirectional circular convolution networks.

Optionally, the forward enhancement network includes: predicting a future network and a past pair of current enhancement networks, wherein inputting image features of a current image frame, image features related to a next image frame of the current image frame, and first prediction features of the current image frame into the forward enhancement network, obtaining first enhancement features of the current image frame, and first prediction features of a next image frame of the predicted current image frame comprises: inputting the image characteristics of the current image frame and the image characteristics related to the next image frame of the current image frame into the prediction future network to obtain the first prediction characteristics of the next image frame of the predicted current image frame; and inputting the image characteristics of the current image frame and the first prediction characteristics of the current image frame into the past current enhancement network to obtain the first enhancement characteristics of the current image frame.

Optionally, the backward enhancement network includes: predicting past and future pairs of current enhancement networks, wherein inputting first enhancement features of a current image frame, image features associated with a previous image frame of the current image frame, and second prediction features of the current image frame into the backward enhancement network, obtaining second enhancement features of the current image frame, and second prediction features of a previous image frame of the predicted current image frame comprises: inputting the first enhancement feature of the current image frame and the image feature related to the previous image frame of the current image frame into a prediction past network to obtain a second prediction feature of the previous image frame of the predicted current image frame; and inputting the first enhancement characteristic of the current image frame and the second prediction characteristic of the current image frame into the future opposite current enhancement network to obtain the second enhancement characteristic of the current image frame.

Optionally, the prediction future network obtains image features related to a next image frame of the current image frame, alignment information between the image features of the current image frame, and predicts a first prediction feature of the next image frame of the current image frame based on the alignment information, the image features of the current image frame, and the image features related to the next image frame of the current image frame.

According to a second aspect of embodiments of the present disclosure, there is provided a training method of a video super-resolution model, the video super-resolution model including: a forward enhancement network and a backward enhancement network, wherein the training method comprises: obtaining a training sample, wherein the training sample comprises: a training video having a plurality of image frames, a high resolution image for each image frame; for each image frame in the training video, the following processing is performed: inputting the image characteristics of the current image frame, the image characteristics related to the next image frame of the current image frame and the first prediction characteristics of the current image frame into the forward enhancement network to obtain the first enhancement characteristics of the current image frame and the first prediction characteristics of the next image frame of the predicted current image frame; inputting the first enhancement feature of the current image frame, the image feature related to the previous image frame of the current image frame and the second prediction feature of the current image frame into the backward enhancement network to obtain the second enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame; obtaining a super-resolution image of the current image frame based on the second enhancement feature of the current image frame and the current image frame; determining a target loss function of the video super-resolution model based on the super-resolution image of each image frame and the high-resolution image thereof; training the video super-resolution model by adjusting parameters of the forward enhancement network and the backward enhancement network according to the target loss function; wherein the first predicted feature of the current image frame is an image feature of the current image frame predicted when the forward enhancement network is used for a previous image frame of the current image frame; the second predicted feature of the current image frame is an image feature of the current image frame predicted when the backward enhancement network is used for a next image frame to the current image frame.

Optionally, the video super-resolution model further includes: the feature extraction network, wherein the training method further comprises: based on the current image frame, the previous image frame of the current image frame and the next image frame of the current image frame, a splicing vector corresponding to the current image frame is obtained; inputting the spliced vector corresponding to the current image frame into the feature extraction network to obtain the image feature of the current image frame, the image feature related to the previous image frame of the current image frame and the image feature related to the next image frame of the current image frame, wherein the step of training the video super-resolution model by adjusting the parameters of the forward enhancement network and the backward enhancement network according to the target loss function comprises the following steps: the video super-resolution model is trained by adjusting parameters of the feature extraction network, the forward enhancement network, and the backward enhancement network according to the target loss function.

According to a third aspect of the embodiments of the present disclosure, there is provided a video processing apparatus including: a forward enhancement unit configured to input, for each image frame in the video, an image feature of a current image frame, an image feature related to a next image frame of the current image frame, and a first prediction feature of the current image frame into a forward enhancement network, resulting in a first enhancement feature of the current image frame and a first prediction feature of a predicted next image frame of the current image frame; a backward enhancement unit configured to input a first enhancement feature of the current image frame, an image feature related to a previous image frame of the current image frame, and a second prediction feature of the current image frame into a backward enhancement network to obtain a second enhancement feature of the current image frame and a second prediction feature of a previous image frame of the predicted current image frame; the super-resolution image acquisition unit is configured to acquire a super-resolution image of the current image frame based on the second enhancement feature of the current image frame and the current image frame; wherein the first predicted feature of the current image frame is an image feature of the current image frame predicted when the forward enhancement network is used for a previous image frame of the current image frame; the second predicted feature of the current image frame is an image feature of the current image frame predicted when the backward enhancement network is used for a next image frame to the current image frame.

Optionally, the video processing device further includes: the splicing vector acquisition unit is configured to obtain a splicing vector corresponding to the current image frame based on the current image frame, the image frame previous to the current image frame and the image frame next to the current image frame; the feature extraction unit is configured to input the spliced vector corresponding to the current image frame into the feature extraction network to obtain the image features of the current image frame, the image features related to the previous image frame of the current image frame and the image features related to the next image frame of the current image frame.

Optionally, the stitching vector acquiring unit is configured to stitch the current image frame, the inter-frame information between the current image frame and the previous image frame, and the inter-frame information between the current image frame and the next image frame, so as to obtain a stitching vector corresponding to the current image frame; or the splicing vector acquisition unit is configured to splice the current image frame, the image frame which is the last image frame of the current image frame and the image frame which is the next image frame of the current image frame to obtain the splicing vector corresponding to the current image frame.

Optionally, the feature extraction network includes: the image processing device comprises a fusion network, a first feature network, a second feature network and a third feature network, wherein the feature extraction unit is configured to input a splicing vector corresponding to a current image frame into the fusion network to obtain a fusion vector corresponding to the current image frame; and respectively inputting the fusion vector corresponding to the current image frame into the first feature network, the second feature network and the third feature network to obtain the image feature which is output by the first feature network and is related to the next image frame of the current image frame, the image feature which is output by the second feature network and is related to the last image frame of the current image frame, and the image feature which is output by the third feature network.

Optionally, the forward enhancement network includes: predicting a future network and a past pair of current enhancement networks, wherein the forward enhancement unit is configured to input image features of a current image frame and image features related to a next image frame of the current image frame into the predicted future network, resulting in a first predicted feature of the next image frame of the predicted current image frame; and inputting the image characteristics of the current image frame and the first prediction characteristics of the current image frame into the past current enhancement network to obtain the first enhancement characteristics of the current image frame.

Optionally, the backward enhancement network includes: predicting a past network and a future pair of current enhancement networks, wherein the backward enhancement unit is configured to input a first enhancement feature of a current image frame and an image feature related to a previous image frame of the current image frame into the predicted past network to obtain a second prediction feature of the previous image frame of the predicted current image frame; and inputting the first enhancement feature of the current image frame and the second prediction feature of the current image frame into the future opposite current enhancement network to obtain the second enhancement feature of the current image frame.

According to a fourth aspect of embodiments of the present disclosure, there is provided a training apparatus for a video super-resolution model, the video super-resolution model including: a forward enhancement network and a backward enhancement network, wherein the training device comprises: a training sample acquisition unit configured to acquire a training sample, wherein the training sample includes: a training video having a plurality of image frames, a high resolution image for each image frame; a forward enhancement unit configured to input, for each image frame in the training video, an image feature of a current image frame, an image feature related to a next image frame of the current image frame, and a first prediction feature of the current image frame into the forward enhancement network, resulting in a first enhancement feature of the current image frame and a first prediction feature of a predicted next image frame of the current image frame; a backward enhancement unit configured to input a first enhancement feature of the current image frame, an image feature related to a previous image frame of the current image frame, and a second prediction feature of the current image frame into the backward enhancement network, to obtain a second enhancement feature of the current image frame and a second prediction feature of a previous image frame of the predicted current image frame; the super-resolution image acquisition unit is configured to acquire a super-resolution image of the current image frame based on the second enhancement feature of the current image frame and the current image frame; a loss function determining unit configured to determine an objective loss function of the video super-resolution model based on the super-resolution image of each image frame and the high-resolution image thereof; a training unit configured to train the video super-resolution model by adjusting parameters of the forward enhancement network and the backward enhancement network according to the target loss function; wherein the first predicted feature of the current image frame is an image feature of the current image frame predicted when the forward enhancement network is used for a previous image frame of the current image frame; the second predicted feature of the current image frame is an image feature of the current image frame predicted when the backward enhancement network is used for a next image frame to the current image frame.

Optionally, the video super-resolution model further includes: a feature extraction network, wherein the training device further comprises: the splicing vector acquisition unit is configured to obtain a splicing vector corresponding to the current image frame based on the current image frame, the image frame previous to the current image frame and the image frame next to the current image frame; the feature extraction unit is configured to input a splicing vector corresponding to the current image frame into the feature extraction network to obtain an image feature of the current image frame, an image feature related to a previous image frame of the current image frame and an image feature related to a next image frame of the current image frame, wherein the training unit is configured to train the video super-resolution model by adjusting parameters of the feature extraction network, the forward enhancement network and the backward enhancement network according to the target loss function.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a video processing method as described above and/or a training method of a video super-resolution model as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by at least one processor, cause the at least one processor to perform a video processing method as described above and/or a training method of a video super-resolution model as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions, characterized in that the computer instructions, when executed by at least one processor, implement a video processing method as described above and/or a training method of a video super-resolution model as described above.

According to the training method, the video processing method and the corresponding equipment of the video super-resolution model, which are disclosed by the embodiment of the invention, the effect of video super-resolution can be improved.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the past and future results of the cyclic convolution network are transformed to the current frame in time sequence and serve as information supplement of the current frame, so that the information of the current frame is further enhanced, namely, a time sequence round trip optimization strategy is provided, and the past and future results are transformed to the current frame by transforming the state of the results of the unidirectional cyclic convolution network, so that the current result is enhanced; the super-resolution result is subjected to hierarchical optimization, so that the problem of pathological solution of super-resolution is relieved;

The method provides a ingenious mode for performing time sequence transformation from different moments to the current frame, skillfully uses the property of time sequence transformation, and performs state transformation through the property of optical flow or difference;

the time sequence conversion image with low resolution is used as input and super resolution is carried out on the time sequence conversion image, so that the neural network carries out super resolution of different degrees on the time sequence redundancy area and the non-redundancy area, namely, the time sequence conversion image is used as the input of the super resolution network, the neural network is used for capturing the time sequence redundancy area and the non-redundancy area of adjacent frames in a self-adaptive mode, and the problem that the influence of local sharing characteristics of the convolutional neural network in video characteristic extraction is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 illustrates a flowchart of a method of training a video super-resolution model according to an exemplary embodiment of the present disclosure;

FIGS. 2 and 3 illustrate examples of video super-resolution models according to exemplary embodiments of the present disclosure;

fig. 4 and 5 illustrate examples of super-resolution effects of a video super-resolution model according to an exemplary embodiment of the present disclosure;

fig. 6 illustrates a flowchart of a video processing method according to an exemplary embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of a training apparatus of a video super-resolution model according to an exemplary embodiment of the present disclosure;

fig. 8 illustrates a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure;

fig. 9 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

Fig. 1 illustrates a flowchart of a training method of a video super-resolution model according to an exemplary embodiment of the present disclosure. The video super-resolution model comprises: a forward enhancement network and a backward enhancement network.

Referring to fig. 1, in step S101, a training sample is acquired.

Here, the training sample may include: a training video having a plurality of image frames, and a high resolution image for each image frame. Each image frame itself has a lower resolution than the resolution of the high resolution image of that image frame.

For each image frame in the training video, steps S102-S104 are performed, in other words, steps S102-S104 are performed with each image frame as the current image frame. It should be appreciated that steps S102-S104 may be performed in parallel for different image frames. For example, steps S102-S104 may be performed for the i-1 th frame while steps S102-S104 are performed for the i-1 th frame, and steps S102-S104 may be performed for the i+1 th frame.

In step S102, the image features of the current image frame, the image features related to the next image frame of the current image frame, and the first prediction features of the current image frame are input into the forward enhancement network, so as to obtain the first enhancement features of the current image frame and the first prediction features of the next image frame of the predicted current image frame. The first enhancement feature of the current image frame is the image feature obtained after the image feature of the current image frame is enhanced by the forward enhancement network.

In step S103, the first enhancement feature of the current image frame, the image feature related to the previous image frame of the current image frame, and the second prediction feature of the current image frame are input into the backward enhancement network, so as to obtain the second enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame. The second enhancement feature of the current image frame is the image feature obtained by enhancing the first enhancement feature of the current image frame through the backward enhancement network.

Here, the first prediction characteristic of the current image frame is an image characteristic of the current image frame predicted when the forward enhancement network is used for a previous image frame of the current image frame, that is, the first prediction characteristic of the current image frame predicted when step S102 is performed for the previous image frame of the current image frame.

The second prediction characteristic of the current image frame is an image characteristic of the current image frame predicted when the backward enhancement network is used for a next image frame of the current image frame, i.e., the second prediction characteristic of the current image frame predicted when step S103 is performed for the next image frame of the current image frame.

Accordingly, the first prediction feature of the next image frame of the current image frame predicted when step S102 is performed for the current image frame is input to the forward enhancement network to be used when step S102 is performed for the next image frame of the current image frame; the second prediction characteristic of the image frame previous to the current image frame predicted when step S103 is performed for the current image frame is input to the backward enhancement network to be used when step S103 is performed for the image frame previous to the current image frame.

As an example, the video super-resolution model may further include: a feature extraction network. As an example, the training method of the video super-resolution model according to the exemplary embodiment of the present disclosure may further include: after step S101, and before step S102, a stitching vector corresponding to the current image frame is obtained based on the current image frame, an image frame previous to the current image frame (i.e., a previous frame), and an image frame next to the current image frame (i.e., a next frame); and inputting the spliced vector corresponding to the current image frame into the feature extraction network to obtain the image features of the current image frame, the image features related to the previous image frame of the current image frame and the image features related to the next image frame of the current image frame.

As an example, the current image frame, the inter-frame information between the current image frame and the previous image frame, and the inter-frame information between the current image frame and the next image frame may be stitched to obtain a stitching vector corresponding to the current image frame. In addition, the current image frame, the inter-frame information between the current image frame and the previous image frame, the inter-frame information between the current image frame and the next image frame, and the fusion vector corresponding to the previous image frame of the current image frame (the manner of obtaining the fusion vector will be described in detail below) may be spliced to obtain the splicing vector corresponding to the current image frame.

As an example, the inter-frame information may include, but is not limited to, at least one of: inter-frame complementary information, inter-frame difference information, and inter-frame alignment information. For example, the inter-frame information between the current image frame and the previous image frame may include, but is not limited to, at least one of: a difference map, a light flow map and a homography matrix between the current image frame and the previous image frame; the inter-frame information between the current image frame and its next image frame may include, but is not limited to, at least one of: a difference map between a current image frame and its next image frame, a light flow map, and a homography matrix.

As another example, the current image frame, the previous image frame of the current image frame, and the next image frame of the current image frame may be stitched to obtain a stitching vector corresponding to the current image frame. In addition, the fusion vector corresponding to the current image frame, the previous image frame of the current image frame, the next image frame of the current image frame and the previous image frame of the current image frame can be spliced to obtain the splicing vector corresponding to the current image frame.

As one example, the image characteristics of the current image frame may be: original image features of the current image frame; the image features associated with the previous image frame of the current image frame may be: original image features of the previous image frame of the current image frame or features of inter-frame information between the current image frame and the previous image frame thereof; the image features associated with the next image frame of the current image frame may be: original image features of the next image frame of the current image frame, or features of inter-frame information between the current image frame and its next image frame.

As another example, the image characteristics of the current image frame may be: image characteristics of the current image frame after super resolution; the image features associated with the previous image frame of the current image frame may be: super-resolved image features of the previous image frame of the current image frame or features of the current image frame and the previous image frame after the inter-frame information between the current image frame and the previous image frame is enhanced; the image features associated with the next image frame of the current image frame may be: super-resolved image features of the next image frame of the current image frame or features of the current image frame and the next image frame after the inter-frame information between the current image frame and the next image frame is enhanced.

For example, the image characteristics of the current image frame may be: image characteristics of the current image frame after super resolution; the image features associated with the previous image frame of the current image frame may be: super-resolved image features of a difference image between the current image frame and the previous image frame or super-resolved image features of an optical flow image between the current image frame and the previous image frame; the image features associated with the next image frame of the current image frame may be: super-resolved image features of a difference map between a current image frame and a next image frame or super-resolved image features of an optical flow map between the current image frame and the next image frame.

As an example, the feature extraction network may include: a converged network, a first feature network, a second feature network, and a third feature network.

As an example, the splicing vector corresponding to the current image frame may be input into the fusion network to obtain the fusion vector corresponding to the current image frame; and then, respectively inputting the fusion vector corresponding to the current image frame into the first feature network, the second feature network and the third feature network to obtain the image feature which is output by the first feature network and is related to the next image frame of the current image frame, the image feature which is output by the second feature network and is related to the last image frame of the current image frame, and the image feature which is output by the third feature network.

As an example, the first feature network, the second feature network, and the third feature network may be super-resolution networks. As an example, the first, second, and third feature networks may be unidirectional circular convolution networks.

As an example, the forward enhancement network may include: future networks and past versus current enhanced networks are predicted. As an example, the image features of the current image frame and the image features related to the next image frame of the current image frame may be input into the prediction future network to obtain the first prediction features of the next image frame of the predicted current image frame; and inputting the image characteristics of the current image frame and the first prediction characteristics of the current image frame into the past current enhancement network to obtain the first enhancement characteristics of the current image frame.

For example, the predictive future network may stitch together image features of a current image frame and image features associated with a next image frame of the current image frame and then pass through a convolutional network to obtain a first predicted feature of a next image frame of the predicted current image frame.

For example, the current enhancement network may splice together image features of the current image frame and first predicted features of the current image frame and then pass through a convolutional network to obtain the first enhanced features of the current image frame.

As an example, the backward enhancement network may include: predicting past networks and future versus current enhanced networks. As an example, the first enhancement feature of the current image frame, and the image feature associated with the image frame previous to the current image frame, may be input into a predictive past network to obtain a second predictive feature of the image frame previous to the predicted current image frame; and inputting the first enhancement feature of the current image frame and the second prediction feature of the current image frame into the future opposite current enhancement network to obtain the second enhancement feature of the current image frame.

For example, the predictive past network may splice together the first enhancement feature of the current image frame and the image features associated with the image frame that is immediately preceding the current image frame and then pass through a convolutional network to obtain the second predictive feature of the image frame that is immediately preceding the predicted current image frame.

For example, a future pair of current enhancement networks may splice together the first enhancement feature of the current image frame and the second prediction feature of the current image frame and then pass through a convolutional network to obtain the second enhancement feature of the current image frame.

As an example, the predictive future network may obtain image features related to a next image frame of the current image frame, alignment information between the image features of the current image frame, and predict a first predicted feature of the next image frame of the current image frame based on the alignment information, the image features of the current image frame, and the image features related to the next image frame of the current image frame. As an example, the prediction pass network may acquire image features related to a previous image frame of the current image frame, alignment information between the image features of the current image frame, and predict second prediction features of the previous image frame of the current image frame based on the alignment information, the image features of the current image frame, and the image features related to the previous image frame of the current image frame. For example, it may be patch-based alignment information.

In another embodiment, the image feature of the current image frame, the image feature related to the previous image frame of the current image frame, and the second prediction feature of the current image frame may be input into the backward enhancement network to obtain the first enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame; and then, inputting the first enhancement characteristic of the current image frame, the image characteristic related to the next image frame of the current image frame and the first prediction characteristic of the current image frame into the forward enhancement network to obtain the second enhancement characteristic of the current image frame and the first prediction characteristic of the next image frame of the predicted current image frame. I.e. the relative positions of the backward and forward enhanced networks are exchanged, it should be understood that this solution is also within the scope of the invention.

In step S104, a super-resolution image of the current image frame is obtained based on the second enhancement feature of the current image frame and the current image frame. It should be appreciated that the resolution of the super-resolution image of the current image frame is higher than the original resolution of the current image frame.

In one embodiment, when the image features of the current image frame are original image features of the current image frame, the video super-resolution model further comprises: the super resolution network, wherein step S104 may include: inputting the second enhancement characteristic of the current image frame into the super-resolution network to obtain a high-resolution detailed image of the current image frame; up-sampling the current image frame to obtain a high-resolution structural image of the current image frame; then, the high-resolution detailed image of the current image frame and the high-resolution structural image of the current image frame are subjected to superposition processing to obtain a super-resolution image of the current image frame.

In another embodiment, when the image features of the current image frame are the super-resolved image features of the current image frame, the current image frame may be up-sampled to obtain a high-resolution structural image of the current image frame; and performing superposition processing on the second enhancement feature of the current image frame and the high-resolution structural image of the current image frame to obtain a super-resolution image of the current image frame.

Specifically, when the extracted image features of the current image frame are the super-resolved image features of the current image frame, namely, the super-resolved image features are subjected to time sequence transformation, so that the function of time sequence round trip is realized, and when the extracted image features of the current image frame are the original image features of the current image frame, namely, the extracted image features are subjected to time sequence transformation on an original image feature layer, then the super-resolved processing is performed subsequently.

In step S105, an objective loss function of the video super-resolution model is determined based on the super-resolution image of each image frame and its high-resolution image.

In step S106, the video super-resolution model is trained by adjusting parameters of the forward enhancement network and the backward enhancement network according to the objective loss function.

As an example, when the video super-resolution model further includes: and when the characteristics are extracted from the network, the parameters of the characteristics extraction network, the forward enhancement network and the backward enhancement network are adjusted according to the target loss function, so that the video super-resolution model is trained.

As an example, when the video super-resolution model further includes: and when the super-resolution network is used, the parameters of the feature extraction network, the forward enhancement network, the backward enhancement network and the super-resolution network are adjusted according to the target loss function, so that the video super-resolution model is trained.

Furthermore, it should be appreciated that the video super-resolution model may be trained using a plurality of training samples.

Fig. 2 and 3 illustrate examples of video super-resolution models according to exemplary embodiments of the present disclosure.

As shown in fig. 2, the timing transformation here employs a differential graph. I.e. current image frame I _t Differential map between current image frame and previous image frameDifferential map between current image frame and its next image frame +.>And fusion vector h corresponding to the image frame previous to the current image frame _t-1 And performing splicing and registration to obtain a splicing vector corresponding to the current image frame. Then inputting the spliced vector corresponding to the current image frame into a fusion network Aggregation to obtain a fusion vector h corresponding to the current image frame _t And respectively let h _t The differential graph of the output of the first characteristic network is obtained by inputting the first characteristic network Future-Residual Head (for example, super-resolution network based on unidirectional circular convolution network), the second characteristic network Spatial-Residual Head (for example, super-resolution network based on unidirectional circular convolution network), and the third characteristic network cast-Residual Head (for example, super-resolution network based on unidirectional circular convolution network)>Image characteristics after superdivision->(i.e., for differential diagram->Enhanced image features), the image features S of the current image frame output by the second feature network after super-division _t And a differential map of the third characteristic network output +.>Image characteristics after superdivision->By enhancing the differential image, the neural network can be helped to carry out super-resolution of different degrees on the redundant area and the non-redundant area, and the limitation of local sharing of convolution on different time sequence areas is solved.

Then, the original image S is added with F or P to perform time sequence transformation, S is transformed to the super-resolution result of the future frame by F, and S is transformed to the super-resolution result of the past frame by P. Then, for future time, there are the results of the future itself and the current to future results; for the past time, there are the past itself results and the current to past results. This process is the process of timing round trip optimization. Specifically, it will S _t And the first prediction feature of the current image frame +.>Inputting the first prediction feature of the image frame next to the current image frame outputted by the forward enhancement network Forward Refinement>And a first enhancement feature S of the current image frame _t ' then S _t ＇、/>And second prediction feature of the current image frame +.>Inputting the second prediction characteristic of the image frame which is the last image frame of the current image frame output by the backward enhancement network into the backward enhancement network Backward Refinement>And a second enhancement feature S of the current image frame _t "is a Chinese character of 'Fu'. Then, the current image frame I _t Upsampling is performedProcessing (e.g. Space to depth) to obtain a high resolution structural image of the current image frame, and processing the S _t "and the high resolution structural image of the current image frame are subjected to a superimposition process (e.g., element-wise Addition) to obtain a super-resolution image of the current image frame->

As shown in fig. 3, a small white circle represents a low resolution image frame of a video, a large white circle represents a super-resolved image frame, a box represents a feature extraction network of a super-resolution model of a video, and a small gray circle represents an image (e.g., a difference map between two adjacent image frames) used for time-series transformation. The embodiment of the disclosure provides a time sequence round trip optimization strategy, optimizes a unidirectional cyclic convolution network and solves the problem of unbalanced information distribution in the unidirectional cyclic convolution network. Compared with a bidirectional cyclic convolution network, the method has smaller calculation overhead, and the original video is required to be subjected to one-time superdivision. In order to more efficiently realize the round trip optimization of the time sequence, the method adopts a time sequence conversion means, and the past and future super-resolution results are converted to the current moment to supplement information with lower cost, so that the problem of pathological solution of super-resolution is further optimized. Fig. 4 and 5 illustrate examples of super-resolution effects of a video super-resolution model according to an exemplary embodiment of the present disclosure.

Table 1 shows the performance of the video super-resolution model of an exemplary embodiment of the present disclosure, as an example, can be verified on Vid4 and UDM10, respectively, where n=1, i.e., one future state and one past state. When the video super-resolution model of the present disclosure (i.e., baseline) was not employed, 27.81 and 39.23PSNR were achieved on Vid4 and UDM10, respectively. When the video super-resolution model of the present disclosure is adopted and a different time sequence transformation method is adopted, such as a method of Optical Flow (i.e., optical Flow), 28.03dB and 39.51dB are achieved; when the differential is adopted as a method of timing conversion (i.e., temporal Residual), PSNR of 28.12 and 39.65 are achieved. The improvement in performance is evident.

Table 1 performance comparison

Fig. 6 shows a flowchart of a video processing method according to an exemplary embodiment of the present disclosure.

Referring to fig. 6, in step S201, for each image frame in a video, an image feature of a current image frame, an image feature related to a next image frame of the current image frame, and a first prediction feature of the current image frame are input into a forward enhancement network, resulting in a first enhancement feature of the current image frame and a first prediction feature of a predicted next image frame of the current image frame.

In step S202, the first enhancement feature of the current image frame, the image feature related to the previous image frame of the current image frame, and the second prediction feature of the current image frame are input into the backward enhancement network, so as to obtain the second enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame.

In step S203, a super-resolution image of the current image frame is obtained based on the second enhancement feature of the current image frame and the current image frame.

Wherein the first predicted feature of the current image frame is an image feature of the current image frame predicted when the forward enhancement network is used for a previous image frame of the current image frame; the second predicted feature of the current image frame is an image feature of the current image frame predicted when the backward enhancement network is used for a next image frame to the current image frame.

As an example, the video processing method according to an exemplary embodiment of the present disclosure may further include: before step S201, a stitching vector corresponding to the current image frame is obtained based on the current image frame, the previous image frame of the current image frame, and the next image frame of the current image frame; and inputting the spliced vector corresponding to the current image frame into the feature extraction network to obtain the image features of the current image frame, the image features related to the previous image frame of the current image frame and the image features related to the next image frame of the current image frame.

As an example, the current image frame, the inter-frame information between the current image frame and the previous image frame, and the inter-frame information between the current image frame and the next image frame may be stitched to obtain a stitching vector corresponding to the current image frame.

As another example, the current image frame, the previous image frame of the current image frame, and the next image frame of the current image frame may be stitched to obtain a stitching vector corresponding to the current image frame.

As an example, the inter information between the current image frame and the previous image frame may include at least one of: a difference map, a light flow map and a homography matrix between the current image frame and the previous image frame; the inter-frame information between the current image frame and its next image frame may include at least one of: a difference map between a current image frame and its next image frame, a light flow map, and a homography matrix.

As an example, the feature extraction network may include: the image fusion method comprises the steps of merging a network, a first characteristic network, a second characteristic network and a third characteristic network, wherein a spliced vector corresponding to a current image frame can be input into the merging network to obtain a merged vector corresponding to the current image frame; and respectively inputting the fusion vector corresponding to the current image frame into the first feature network, the second feature network and the third feature network to obtain the image feature which is output by the first feature network and is related to the next image frame of the current image frame, the image feature which is output by the second feature network and is related to the last image frame of the current image frame, and the image feature which is output by the third feature network.

As an example, the image characteristics of the current image frame may be: original image features of the current image frame or image features of the current image frame after super resolution;

the image feature associated with the previous image frame of the current image frame may be one of: original image characteristics of an image frame which is the last image frame of the current image frame, image characteristics of an image frame which is the last image frame of the current image frame after super resolution, characteristics of inter-frame information between the current image frame and the last image frame of the current image frame, and characteristics of the inter-frame information between the current image frame and the last image frame of the current image frame after enhancement;

the image feature associated with the next image frame of the current image frame may be one of: original image characteristics of a next image frame of the current image frame, super-resolved image characteristics of the next image frame of the current image frame, characteristics of inter-frame information between the current image frame and the next image frame of the current image frame, and characteristics of inter-frame information between the current image frame and the next image frame of the current image frame after enhancement.

As an example, the image characteristics of the current image frame may be: image characteristics of the current image frame after super resolution; the image features associated with the previous image frame of the current image frame may be: super-resolved image features of a difference image between the current image frame and the previous image frame or super-resolved image features of an optical flow image between the current image frame and the previous image frame; the image features associated with the next image frame of the current image frame may be: super-resolved image features of a difference map between a current image frame and a next image frame or super-resolved image features of an optical flow map between the current image frame and the next image frame.

As an example, the first, second, and third feature networks may be unidirectional circular convolution networks.

As an example, the forward enhancement network may include: future networks and past versus current enhanced networks are predicted.

As an example, the image features of the current image frame and the image features related to the next image frame of the current image frame may be input into the prediction future network to obtain the first prediction features of the next image frame of the predicted current image frame; and inputting the image characteristics of the current image frame and the first prediction characteristics of the current image frame into the past current enhancement network to obtain the first enhancement characteristics of the current image frame.

As an example, the backward enhancement network may include: predicting past networks and future versus current enhanced networks.

As an example, the first enhancement feature of the current image frame, and the image feature associated with the image frame previous to the current image frame, may be input into a predictive past network to obtain a second predictive feature of the image frame previous to the predicted current image frame; and inputting the first enhancement feature of the current image frame and the second prediction feature of the current image frame into the future opposite current enhancement network to obtain the second enhancement feature of the current image frame.

As an example, the predictive future network may obtain image features related to a next image frame of the current image frame, alignment information between the image features of the current image frame, and predict a first predicted feature of the next image frame of the current image frame based on the alignment information, the image features of the current image frame, and the image features related to the next image frame of the current image frame.

As an example, when the image features of the current image frame are original image features of the current image frame, the video super-resolution model further includes: the super resolution network, wherein step S203 may include: inputting the second enhancement characteristic of the current image frame into the super-resolution network to obtain a high-resolution detailed image of the current image frame; up-sampling the current image frame to obtain a high-resolution structural image of the current image frame; and performing superposition processing on the high-resolution detailed image of the current image frame and the high-resolution structural image of the current image frame to obtain a super-resolution image of the current image frame.

As an example, when the image feature of the current image frame is the super-resolved image feature of the current image frame, step S203 may include: up-sampling the current image frame to obtain a high-resolution structural image of the current image frame; and performing superposition processing on the second enhancement feature of the current image frame and the high-resolution structural image of the current image frame to obtain a super-resolution image of the current image frame.

As an example, the feature extraction network, the forward enhancement network, and the backward enhancement network may be trained using the training method as described in the above exemplary embodiments.

Specific processes in the video processing method according to the exemplary embodiment of the present disclosure have been described in detail in the above-described embodiments of the training method of the related video super-resolution model, and will not be described in detail herein.

Fig. 7 shows a block diagram of a training apparatus of a video super-resolution model according to an exemplary embodiment of the present disclosure. The video super-resolution model comprises: a forward enhancement network and a backward enhancement network.

As shown in fig. 7, the training apparatus 10 of the video super-resolution model according to the exemplary embodiment of the present disclosure includes: a training sample acquisition unit 101, a forward enhancement unit 102, a backward enhancement unit 103, a super-resolution result acquisition unit 104, a loss function determination unit 105, and a training unit 106.

Specifically, the training sample acquisition unit 101 is configured to acquire training samples, wherein the training samples include: a training video having a plurality of image frames, and a high resolution image for each image frame.

The forward enhancement unit 102 is configured to input, for each image frame in the training video, an image feature of a current image frame, an image feature related to a next image frame of the current image frame, and a first prediction feature of the current image frame into the forward enhancement network, resulting in a first enhancement feature of the current image frame and a first prediction feature of a predicted next image frame of the current image frame.

The backward enhancement unit 103 is configured to input a first enhancement feature of the current image frame, an image feature related to a previous image frame of the current image frame, and a second prediction feature of the current image frame into the backward enhancement network, resulting in a second enhancement feature of the current image frame and a second prediction feature of a previous image frame of the predicted current image frame.

The super-resolution image acquisition unit 104 is configured to obtain a super-resolution image of the current image frame based on the second enhancement feature of the current image frame and the current image frame.

The loss function determination unit 105 is configured to determine an objective loss function of the video super-resolution model based on the super-resolution image of each image frame and its high-resolution image.

The training unit 106 is configured to train the video super-resolution model by adjusting parameters of the forward enhancement network and the backward enhancement network according to the target loss function.

As an example, the image characteristics of the current image frame may be: image characteristics of the current image frame after super resolution; the image features associated with the previous image frame of the current image frame may be: super-resolved image features of the previous image frame of the current image frame or features of the current image frame and the previous image frame after the inter-frame information between the current image frame and the previous image frame is enhanced; the image features associated with the next image frame of the current image frame may be: super-resolved image features of the next image frame of the current image frame or features of the current image frame and the next image frame after the inter-frame information between the current image frame and the next image frame is enhanced.

As an example, the video super-resolution model may further include: a feature extraction network, wherein the training device 10 may further comprise: a stitching vector acquisition unit (not shown) and a feature extraction unit (not shown), wherein the stitching vector acquisition unit is configured to obtain a stitching vector corresponding to the current image frame based on the current image frame, an image frame previous to the current image frame, and an image frame next to the current image frame; the feature extraction unit is configured to input the stitching vector corresponding to the current image frame into the feature extraction network to obtain an image feature of the current image frame, an image feature related to a previous image frame of the current image frame, and an image feature related to a next image frame of the current image frame, wherein the training unit 106 may be configured to train the video super-resolution model by adjusting parameters of the feature extraction network, the forward enhancement network, and the backward enhancement network according to the target loss function.

As an example, the stitching vector obtaining unit may be configured to stitch the current image frame, the inter-frame information between the current image frame and the previous image frame, and the inter-frame information between the current image frame and the next image frame, to obtain a stitching vector corresponding to the current image frame; alternatively, the stitching vector obtaining unit may be configured to stitch the current image frame, an image frame previous to the current image frame, and an image frame next to the current image frame to obtain the stitching vector corresponding to the current image frame.

As an example, the feature extraction network may include: the feature extraction unit can be configured to input a splicing vector corresponding to the current image frame into the fusion network to obtain a fusion vector corresponding to the current image frame; and respectively inputting the fusion vector corresponding to the current image frame into the first feature network, the second feature network and the third feature network to obtain the image feature which is output by the first feature network and is related to the next image frame of the current image frame, the image feature which is output by the second feature network and is related to the last image frame of the current image frame, and the image feature which is output by the third feature network.

As an example, the forward enhancement network may include: predicting a future network and a past pair of current enhancement networks, wherein the forward enhancement unit 102 may be configured to input image features of a current image frame and image features related to a next image frame of the current image frame into the predicted future network, resulting in a first predicted feature of a next image frame of the predicted current image frame; and inputting the image characteristics of the current image frame and the first prediction characteristics of the current image frame into the past current enhancement network to obtain the first enhancement characteristics of the current image frame.

As an example, the backward enhancement network may include: a predicted past network and a future pair current enhancement network, wherein the backward enhancement unit 103 may be configured to input a first enhancement feature of a current image frame and an image feature related to a previous image frame of the current image frame into the predicted past network, resulting in a second prediction feature of the previous image frame of the predicted current image frame; and inputting the first enhancement feature of the current image frame and the second prediction feature of the current image frame into the future opposite current enhancement network to obtain the second enhancement feature of the current image frame.

As shown in fig. 8, the video processing apparatus 20 according to an exemplary embodiment of the present disclosure includes: a forward enhancement unit 201, a backward enhancement unit 202, and a super-resolution image acquisition unit 203.

Specifically, the forward enhancement unit 201 is configured to input, for each image frame in the video, an image feature of a current image frame, an image feature related to a next image frame of the current image frame, and a first prediction feature of the current image frame into the forward enhancement network, resulting in a first enhancement feature of the current image frame and a first prediction feature of a predicted next image frame of the current image frame.

The backward enhancement unit 202 is configured to input the first enhancement feature of the current image frame, the image feature related to the previous image frame of the current image frame, and the second prediction feature of the current image frame into the backward enhancement network, resulting in the second enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame.

The super-resolution image acquisition unit 203 is configured to obtain a super-resolution image of the current image frame based on the second enhancement feature of the current image frame and the current image frame.

As an example, the video processing device 20 may further include: a stitching vector acquisition unit (not shown) and a feature extraction unit (not shown), wherein the stitching vector acquisition unit is configured to obtain a stitching vector corresponding to the current image frame based on the current image frame, an image frame previous to the current image frame, and an image frame next to the current image frame; the feature extraction unit is configured to input a stitching vector corresponding to the current image frame into the feature extraction network to obtain image features of the current image frame, image features related to a previous image frame of the current image frame and image features related to a next image frame of the current image frame.

As an example, the forward enhancement network may include: predicting a future network and a past pair of current enhancement networks, wherein the forward enhancement unit 201 may be configured to input image features of a current image frame and image features related to a next image frame of the current image frame into the predicting future network, resulting in a first prediction feature of a next image frame of the predicted current image frame; and inputting the image characteristics of the current image frame and the first prediction characteristics of the current image frame into the past current enhancement network to obtain the first enhancement characteristics of the current image frame.

As an example, the backward enhancement network may include: a predicted past network and a future pair current enhancement network, wherein the backward enhancement unit 202 may be configured to input a first enhancement feature of a current image frame and an image feature related to a previous image frame of the current image frame into the predicted past network, resulting in a second prediction feature of the previous image frame of the predicted current image frame; and inputting the first enhancement feature of the current image frame and the second prediction feature of the current image frame into the future opposite current enhancement network to obtain the second enhancement feature of the current image frame.

The specific manner in which the respective units perform the operations in the apparatus of the above embodiments has been described in detail in relation to the embodiments of the method, and will not be described in detail here.

Further, it should be understood that the various units in the training device 10 and the video processing device 20 of the video super-resolution model according to the exemplary embodiments of the present disclosure may be implemented as hardware components and/or software components. The individual units may be implemented, for example, using a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC), depending on the processing performed by the individual units as defined.

Fig. 9 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure. Referring to fig. 9, the electronic device 30 includes: at least one memory 301 and at least one processor 302, the at least one memory 301 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 302, perform the training method and/or the video processing method of the video super-resolution model as described in the above exemplary embodiments.

By way of example, electronic device 30 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 30 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction sets) individually or in combination. The electronic device 30 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

In electronic device 30, processor 302 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 302 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The processor 302 may execute instructions or code stored in the memory 301, wherein the memory 301 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory 301 may be integrated with the processor 302, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, the memory 301 may include a stand-alone device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The memory 301 and the processor 302 may be operatively coupled or may communicate with each other, for example, through an I/O port, network connection, etc., such that the processor 302 is able to read files stored in the memory.

In addition, the electronic device 30 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 30 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, a computer readable storage medium storing instructions may also be provided, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform the training method and/or the video processing method of the video super-resolution model as described in the above exemplary embodiments. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, the instructions in which are executable by at least one processor to perform the training method and/or the video processing method of the video super-resolution model as described in the above exemplary embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A video processing method, comprising:

inputting the image characteristics of the current image frame, the image characteristics related to the next image frame of the current image frame and the first prediction characteristics of the current image frame into a forward enhancement network for each image frame in the video to obtain the first enhancement characteristics of the current image frame and the first prediction characteristics of the next image frame of the predicted current image frame;

Inputting the first enhancement feature of the current image frame, the image feature related to the previous image frame of the current image frame and the second prediction feature of the current image frame into a backward enhancement network to obtain the second enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame;

obtaining a super-resolution image of the current image frame based on the second enhancement feature of the current image frame and the current image frame;

2. The video processing method according to claim 1, wherein the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution;

the image features associated with the previous image frame of the current image frame are: super-resolved image features of the previous image frame of the current image frame or features of the current image frame and the previous image frame after the inter-frame information between the current image frame and the previous image frame is enhanced;

The image features associated with the next image frame to the current image frame are: super-resolved image features of the next image frame of the current image frame or features of the current image frame and the next image frame after the inter-frame information between the current image frame and the next image frame is enhanced.

3. The video processing method according to claim 2, wherein the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution;

the image features associated with the previous image frame of the current image frame are: super-resolved image features of a difference image between the current image frame and the previous image frame or super-resolved image features of an optical flow image between the current image frame and the previous image frame;

the image features associated with the next image frame to the current image frame are: super-resolved image features of a difference map between a current image frame and a next image frame or super-resolved image features of an optical flow map between the current image frame and the next image frame.

4. The video processing method according to claim 1, characterized in that the video processing method further comprises:

based on the current image frame, the previous image frame of the current image frame and the next image frame of the current image frame, a splicing vector corresponding to the current image frame is obtained;

And inputting the spliced vector corresponding to the current image frame into a feature extraction network to obtain the image features of the current image frame, the image features related to the previous image frame of the current image frame and the image features related to the next image frame of the current image frame.

5. The video processing method according to claim 4, wherein the step of obtaining the stitching vector corresponding to the current image frame based on the current image frame, the image frame previous to the current image frame, and the image frame next to the current image frame comprises:

splicing the current image frame, the inter-frame information between the current image frame and the previous image frame and the inter-frame information between the current image frame and the next image frame to obtain a splicing vector corresponding to the current image frame;

or, the current image frame, the image frame which is the last image frame of the current image frame and the image frame which is the next image frame of the current image frame are spliced to obtain the splicing vector corresponding to the current image frame.

6. The video processing method of claim 4, wherein the feature extraction network comprises: a converged network, a first feature network, a second feature network, and a third feature network,

The step of inputting the stitching vector corresponding to the current image frame into the feature extraction network to obtain the image features of the current image frame, the image features related to the previous image frame of the current image frame and the image features related to the next image frame of the current image frame comprises the following steps:

inputting the spliced vector corresponding to the current image frame into the fusion network to obtain the fusion vector corresponding to the current image frame;

and respectively inputting the fusion vector corresponding to the current image frame into the first feature network, the second feature network and the third feature network to obtain the image feature which is output by the first feature network and is related to the next image frame of the current image frame, the image feature which is output by the second feature network and is related to the last image frame of the current image frame, and the image feature which is output by the third feature network.

7. The video processing method of claim 6, wherein the first feature network, the second feature network, and the third feature network are unidirectional circular convolution networks.

8. The video processing method of claim 1, wherein the forward enhancement network comprises: predicting future networks and past versus current enhanced networks,

The step of inputting the image feature of the current image frame, the image feature related to the next image frame of the current image frame and the first prediction feature of the current image frame into the forward enhancement network to obtain the first enhancement feature of the current image frame and the first prediction feature of the next image frame of the predicted current image frame comprises the following steps:

inputting the image characteristics of the current image frame and the image characteristics related to the next image frame of the current image frame into the prediction future network to obtain the first prediction characteristics of the next image frame of the predicted current image frame;

and inputting the image characteristics of the current image frame and the first prediction characteristics of the current image frame into the past current enhancement network to obtain the first enhancement characteristics of the current image frame.

9. The video processing method of claim 1, wherein the backward enhancement network comprises: predicting past networks and future versus current enhanced networks,

the step of inputting the first enhancement feature of the current image frame, the image feature related to the previous image frame of the current image frame and the second prediction feature of the current image frame into the backward enhancement network to obtain the second enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame comprises the following steps:

Inputting the first enhancement feature of the current image frame and the image feature related to the previous image frame of the current image frame into a prediction past network to obtain a second prediction feature of the previous image frame of the predicted current image frame;

and inputting the first enhancement characteristic of the current image frame and the second prediction characteristic of the current image frame into the future opposite current enhancement network to obtain the second enhancement characteristic of the current image frame.

10. The video processing method of claim 8, wherein the predictive future network obtains image features associated with a next image frame of a current image frame, alignment information between the image features of the current image frame, and predicts a first predicted feature of the next image frame of the current image frame based on the alignment information, the image features of the current image frame, and the image features associated with the next image frame of the current image frame.

11. The training method of the video super-resolution model is characterized in that the video super-resolution model comprises the following steps: a forward enhancement network and a backward enhancement network, wherein the training method comprises:

obtaining a training sample, wherein the training sample comprises: a training video having a plurality of image frames, a high resolution image for each image frame;

For each image frame in the training video, the following processing is performed:

inputting the image characteristics of the current image frame, the image characteristics related to the next image frame of the current image frame and the first prediction characteristics of the current image frame into the forward enhancement network to obtain the first enhancement characteristics of the current image frame and the first prediction characteristics of the next image frame of the predicted current image frame;

inputting the first enhancement feature of the current image frame, the image feature related to the previous image frame of the current image frame and the second prediction feature of the current image frame into the backward enhancement network to obtain the second enhancement feature of the current image frame and the second prediction feature of the previous image frame of the predicted current image frame;

determining a target loss function of the video super-resolution model based on the super-resolution image of each image frame and the high-resolution image thereof;

training the video super-resolution model by adjusting parameters of the forward enhancement network and the backward enhancement network according to the target loss function;

12. The training method of claim 11, wherein the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution;

13. The training method of claim 12, wherein the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution;

14. The training method of claim 11, wherein the video super-resolution model further comprises: a feature extraction network that is configured to extract features,

wherein, the training method further comprises:

inputting the stitching vector corresponding to the current image frame into the feature extraction network to obtain the image features of the current image frame, the image features related to the previous image frame of the current image frame and the image features related to the next image frame of the current image frame,

wherein training the video super-resolution model by adjusting parameters of the forward enhancement network and the backward enhancement network according to the target loss function comprises: the video super-resolution model is trained by adjusting parameters of the feature extraction network, the forward enhancement network, and the backward enhancement network according to the target loss function.

15. The training method of claim 14, wherein the step of obtaining the stitching vector corresponding to the current image frame based on the current image frame, the image frame previous to the current image frame, and the image frame next to the current image frame comprises:

16. The training method of claim 14, wherein the feature extraction network comprises: a converged network, a first feature network, a second feature network, and a third feature network,

17. The training method of claim 16, wherein the first characteristic network, the second characteristic network, and the third characteristic network are unidirectional circular convolution networks.

18. The training method of claim 11, wherein the forward enhancement network comprises: predicting future networks and past versus current enhanced networks,

19. The training method of claim 11, wherein the backward enhancement network comprises: predicting past networks and future versus current enhanced networks,

20. The training method of claim 18, wherein the predictive future network obtains image features associated with a next image frame of the current image frame, alignment information between the image features of the current image frame, and predicts a first predicted feature of the next image frame of the current image frame based on the alignment information, the image features of the current image frame, and the image features associated with the next image frame of the current image frame.

21. A video processing apparatus, comprising:

a forward enhancement unit configured to input, for each image frame in the video, an image feature of a current image frame, an image feature related to a next image frame of the current image frame, and a first prediction feature of the current image frame into a forward enhancement network, resulting in a first enhancement feature of the current image frame and a first prediction feature of a predicted next image frame of the current image frame;

a backward enhancement unit configured to input a first enhancement feature of the current image frame, an image feature related to a previous image frame of the current image frame, and a second prediction feature of the current image frame into a backward enhancement network to obtain a second enhancement feature of the current image frame and a second prediction feature of a previous image frame of the predicted current image frame;

The super-resolution image acquisition unit is configured to acquire a super-resolution image of the current image frame based on the second enhancement feature of the current image frame and the current image frame;

22. The video processing device of claim 21, wherein the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution;

23. The video processing device of claim 22, wherein the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution;

24. The video processing apparatus according to claim 21, characterized in that the video processing apparatus further comprises:

the splicing vector acquisition unit is configured to obtain a splicing vector corresponding to the current image frame based on the current image frame, the image frame previous to the current image frame and the image frame next to the current image frame;

the feature extraction unit is configured to input the spliced vector corresponding to the current image frame into the feature extraction network to obtain the image features of the current image frame, the image features related to the previous image frame of the current image frame and the image features related to the next image frame of the current image frame.

25. The apparatus according to claim 24, wherein the stitching vector obtaining unit is configured to stitch the current image frame, the inter-frame information between the current image frame and the previous image frame, and the inter-frame information between the current image frame and the next image frame to obtain the stitching vector corresponding to the current image frame;

or the splicing vector acquisition unit is configured to splice the current image frame, the image frame which is the last image frame of the current image frame and the image frame which is the next image frame of the current image frame to obtain the splicing vector corresponding to the current image frame.

26. The video processing device of claim 24, wherein the feature extraction network comprises: a converged network, a first feature network, a second feature network, and a third feature network,

the feature extraction unit is configured to input a splicing vector corresponding to the current image frame into the fusion network to obtain a fusion vector corresponding to the current image frame; and respectively inputting the fusion vector corresponding to the current image frame into the first feature network, the second feature network and the third feature network to obtain the image feature which is output by the first feature network and is related to the next image frame of the current image frame, the image feature which is output by the second feature network and is related to the last image frame of the current image frame, and the image feature which is output by the third feature network.

27. The video processing device of claim 26, wherein the first feature network, the second feature network, and the third feature network are unidirectional circular convolution networks.

28. The video processing device of claim 21, wherein the forward enhancement network comprises: predicting future networks and past versus current enhanced networks,

wherein the forward enhancement unit is configured to input image features of the current image frame and image features related to a next image frame of the current image frame into the predicted future network, resulting in a first predicted feature of the next image frame of the predicted current image frame; and inputting the image characteristics of the current image frame and the first prediction characteristics of the current image frame into the past current enhancement network to obtain the first enhancement characteristics of the current image frame.

29. The video processing device of claim 21, wherein the backward enhancement network comprises: predicting past networks and future versus current enhanced networks,

wherein the backward enhancement unit is configured to input the first enhancement feature of the current image frame and the image feature related to the previous image frame of the current image frame into the predicted past network to obtain the second prediction feature of the previous image frame of the predicted current image frame; and inputting the first enhancement feature of the current image frame and the second prediction feature of the current image frame into the future opposite current enhancement network to obtain the second enhancement feature of the current image frame.

30. The video processing device of claim 28, wherein the predictive future network obtains image features associated with a next image frame of a current image frame, alignment information between the image features of the current image frame, and predicts a first predicted feature of the next image frame of the current image frame based on the alignment information, the image features of the current image frame, and the image features associated with the next image frame of the current image frame.

31. A training device for a video super-resolution model, wherein the video super-resolution model comprises: a forward enhancement network and a backward enhancement network, wherein the training device comprises:

a training sample acquisition unit configured to acquire a training sample, wherein the training sample includes: a training video having a plurality of image frames, a high resolution image for each image frame;

a forward enhancement unit configured to input, for each image frame in the training video, an image feature of a current image frame, an image feature related to a next image frame of the current image frame, and a first prediction feature of the current image frame into the forward enhancement network, resulting in a first enhancement feature of the current image frame and a first prediction feature of a predicted next image frame of the current image frame;

A backward enhancement unit configured to input a first enhancement feature of the current image frame, an image feature related to a previous image frame of the current image frame, and a second prediction feature of the current image frame into the backward enhancement network, to obtain a second enhancement feature of the current image frame and a second prediction feature of a previous image frame of the predicted current image frame;

a loss function determining unit configured to determine an objective loss function of the video super-resolution model based on the super-resolution image of each image frame and the high-resolution image thereof;

a training unit configured to train the video super-resolution model by adjusting parameters of the forward enhancement network and the backward enhancement network according to the target loss function;

32. The training device of claim 31, wherein the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution;

33. The training device of claim 32, wherein the image characteristics of the current image frame are: image characteristics of the current image frame after super resolution;

34. The training device of claim 31, wherein the video super-resolution model further comprises: a feature extraction network that is configured to extract features,

wherein the training device further comprises:

a feature extraction unit configured to input a stitching vector corresponding to a current image frame into the feature extraction network to obtain an image feature of the current image frame, an image feature related to a previous image frame of the current image frame, an image feature related to a next image frame of the current image frame,

wherein the training unit is configured to train the video super-resolution model by adjusting parameters of the feature extraction network, the forward enhancement network, and the backward enhancement network according to the target loss function.

35. The training device of claim 34, wherein the stitching vector acquisition unit is configured to stitch the current image frame, the inter-frame information between the current image frame and the previous image frame, and the inter-frame information between the current image frame and the next image frame to obtain the stitching vector corresponding to the current image frame;

36. The training device of claim 34, wherein the feature extraction network comprises: a converged network, a first feature network, a second feature network, and a third feature network,

37. The training device of claim 36, wherein the first characteristic network, the second characteristic network, and the third characteristic network are unidirectional circular convolution networks.

38. The training device of claim 31, wherein the forward enhancement network comprises: predicting future networks and past versus current enhanced networks,

39. The training device of claim 31, wherein the backward enhancement network comprises: predicting past networks and future versus current enhanced networks,

40. The training device of claim 38, wherein the predictive future network obtains image features associated with a next image frame of the current image frame, alignment information between the image features of the current image frame, and predicts a first predicted feature of the next image frame of the current image frame based on the alignment information, the image features of the current image frame, and the image features associated with the next image frame of the current image frame.

41. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the video processing method of any one of claims 1 to 10 and/or the training method of the video super resolution model of any one of claims 11 to 20.

42. A computer-readable storage medium, characterized in that instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the video processing method of any one of claims 1 to 10 and/or the training method of the video super-resolution model of any one of claims 11 to 20.

43. Computer program product comprising computer instructions, which when executed by at least one processor implement a video processing method according to any one of claims 1 to 10 and/or a training method of a video super-resolution model according to any one of claims 11 to 20.